Peek Tool

Get a quick overview and metadata of any document with lightning-fast performance.

Overview

The docsray_peek tool provides instant document insights without full processing:

Document metadata (title, author, page count, file size)
Format detection and compatibility check
Quick preview of document content
Provider capabilities for the document type
Processing estimates for different operations

Basic Usage

Quick Document Check

# Get basic metadata
result = docsray.peek("document.pdf")

print(f"Title: {result['metadata']['title']}")
print(f"Pages: {result['metadata']['page_count']}")  
print(f"Size: {result['metadata']['file_size']} bytes")

With Depth Control

# Metadata only (fastest)
result = docsray.peek("document.pdf", depth="metadata")

# Include document structure
result = docsray.peek("document.pdf", depth="structure") 

# Include content preview
result = docsray.peek("document.pdf", depth="preview")

Parameters

document_url (required)

Path or URL to the document to analyze.

# Local files
docsray.peek("./reports/quarterly.pdf")
docsray.peek("/home/user/documents/contract.pdf")

# URLs
docsray.peek("https://example.com/document.pdf")

depth (optional)

Level of analysis detail. Default: "structure"

"metadata" - Basic file information only
"structure" - Metadata + document structure
"preview" - Metadata + structure + content preview

# Fastest - metadata only
docsray.peek("doc.pdf", depth="metadata")

# Balanced - includes structure
docsray.peek("doc.pdf", depth="structure")

# Most comprehensive - includes preview
docsray.peek("doc.pdf", depth="preview")

provider (optional)

Provider to use for analysis. Default: "auto"

# Let Docsray choose the best provider
docsray.peek("document.pdf")  # Uses auto-selection

# Force specific provider
docsray.peek("document.pdf", provider="pymupdf4llm")  # Fast
docsray.peek("document.pdf", provider="llama-parse")  # Comprehensive

Response Structure

The peek tool returns a structured response with different levels of detail:

Metadata Response (`depth="metadata"`)

{
  "metadata": {
    "title": "Quarterly Financial Report",
    "author": "Finance Department", 
    "subject": "Q3 2023 Results",
    "creator": "Microsoft Word",
    "creation_date": "2023-10-15T10:30:00Z",
    "modification_date": "2023-10-16T14:20:00Z",
    "page_count": 25,
    "format": "pdf",
    "file_size": 2048576,
    "has_images": true,
    "has_tables": true,
    "is_encrypted": false,
    "language": "en"
  },
  "provider": "pymupdf4llm"
}

Structure Response (`depth="structure"`)

{
  "metadata": { /* ... metadata fields ... */ },
  "structure": {
    "outline": [
      {"title": "Executive Summary", "page": 1, "level": 1},
      {"title": "Financial Highlights", "page": 3, "level": 1},
      {"title": "Revenue Analysis", "page": 5, "level": 2},
      {"title": "Cost Structure", "page": 8, "level": 2}
    ],
    "sections": [
      {"type": "header", "content": "Executive Summary", "page": 1},
      {"type": "paragraph", "content": "This quarter...", "page": 1},
      {"type": "table", "content": "Financial Data", "page": 3}
    ],
    "page_info": [
      {"page": 1, "type": "content", "elements": 5},
      {"page": 2, "type": "content", "elements": 8}
    ]
  },
  "provider": "pymupdf4llm"
}

Preview Response (`depth="preview"`)

{
  "metadata": { /* ... metadata fields ... */ },
  "structure": { /* ... structure fields ... */ },
  "preview": {
    "first_page": "Executive Summary\n\nThis quarterly report presents...",
    "sample_content": "Key highlights from the document...",
    "key_sections": [
      "Executive Summary",
      "Financial Performance", 
      "Market Analysis"
    ],
    "preview_length": 500,
    "total_length": 45000
  },
  "provider": "pymupdf4llm"
}

Use Cases

Document Validation

Check if a document is readable and get basic information:

def validate_document(path):
    result = docsray.peek(path, depth="metadata")
    
    if "error" in result:
        return False, result["error"]
    
    metadata = result["metadata"]
    if metadata["page_count"] == 0:
        return False, "Document has no pages"
    
    if metadata["is_encrypted"]:
        return False, "Document is password protected"
    
    return True, f"Valid PDF with {metadata['page_count']} pages"

File Browser Integration

Show document previews in file managers:

def get_document_info(file_path):
    result = docsray.peek(file_path, depth="structure")
    
    return {
        "title": result["metadata"].get("title", "Untitled"),
        "pages": result["metadata"]["page_count"],
        "size": format_file_size(result["metadata"]["file_size"]),
        "sections": len(result["structure"]["outline"]),
        "preview": result.get("preview", {}).get("first_page", "")[:200]
    }

Batch Document Processing

Quickly analyze multiple documents:

import os

def analyze_document_folder(folder_path):
    pdf_files = [f for f in os.listdir(folder_path) if f.endswith('.pdf')]
    results = []
    
    for pdf_file in pdf_files:
        file_path = os.path.join(folder_path, pdf_file)
        result = docsray.peek(file_path, depth="metadata")
        
        if "error" not in result:
            results.append({
                "filename": pdf_file,
                "title": result["metadata"].get("title", pdf_file),
                "pages": result["metadata"]["page_count"],
                "size": result["metadata"]["file_size"]
            })
    
    return results

Document Triage

Decide which documents need detailed processing:

def triage_documents(document_list):
    for doc_path in document_list:
        result = docsray.peek(doc_path, depth="structure")
        
        metadata = result["metadata"]
        structure = result["structure"]
        
        # Prioritize based on characteristics
        priority = "low"
        
        if metadata["page_count"] > 50:
            priority = "high"  # Large documents
        elif len(structure["outline"]) > 10:
            priority = "medium"  # Complex structure
        elif metadata.get("has_tables", False):
            priority = "medium"  # Contains data tables
            
        print(f"{doc_path}: {priority} priority ({metadata['page_count']} pages)")

Performance Characteristics

Processing Speed by Depth

Depth	Typical Time	Use Case
`metadata`	0.1-0.5s	Quick file validation
`structure`	0.3-1.0s	File browser previews
`preview`	0.5-2.0s	Content previews

Provider Performance

Provider	Speed	Best For
PyMuPDF4LLM	Very Fast	Quick metadata and structure
LlamaParse	Slower	Detailed analysis when needed
Auto	Balanced	General use (defaults to PyMuPDF4LLM)

Error Handling

The peek tool includes comprehensive error handling:

Common Errors

result = docsray.peek("nonexistent.pdf")
if "error" in result:
    error_type = result.get("type", "Unknown")
    if error_type == "FileNotFoundError":
        print("File does not exist")
    elif error_type == "PermissionError":
        print("Cannot access file - check permissions")
    elif "encrypted" in result["error"]:
        print("Document is password protected")

Graceful Error Handling

def safe_peek(document_path):
    try:
        result = docsray.peek(document_path)
        
        if "error" in result:
            return None, result["error"]
        
        return result, None
        
    except Exception as e:
        return None, f"Unexpected error: {str(e)}"

# Usage
document, error = safe_peek("document.pdf")
if error:
    print(f"Cannot analyze document: {error}")
else:
    print(f"Document has {document['metadata']['page_count']} pages")

Advanced Usage

Custom Provider Selection

# Always use fast provider for peek operations
result = docsray.peek("document.pdf", provider="pymupdf4llm")

# Use AI provider for enhanced metadata extraction
result = docsray.peek("document.pdf", provider="llama-parse", depth="preview")

Conditional Processing

def smart_document_processing(doc_path):
    # Quick peek first
    peek_result = docsray.peek(doc_path, depth="metadata")
    
    if "error" in peek_result:
        return peek_result
    
    metadata = peek_result["metadata"]
    
    # Decide on further processing based on peek results
    if metadata["page_count"] > 100:
        # Large document - use fast provider
        return docsray.extract(doc_path, provider="pymupdf4llm")
    elif metadata.get("has_tables", False):
        # Has tables - use AI provider for better table extraction
        return docsray.xray(doc_path, provider="llama-parse")
    else:
        # Standard document - use default processing
        return docsray.extract(doc_path)

Caching Strategy

Peek results are automatically cached for better performance:

# First call - processes document
result1 = docsray.peek("document.pdf", depth="structure")  # ~0.5s

# Second call - returns cached result
result2 = docsray.peek("document.pdf", depth="structure")  # ~0.01s

# Different depth - may require new processing
result3 = docsray.peek("document.pdf", depth="preview")   # ~1.0s

Best Practices

Start with Peek - Always use peek before expensive operations
Choose Appropriate Depth - Use metadata for validation, structure for overview, preview for content sampling
Handle Errors Gracefully - Always check for error responses
Cache Results - Leverage automatic caching for repeated access
Provider Selection - Use PyMuPDF4LLM for speed, LlamaParse for comprehensive analysis

Integration Examples

Web Application

from flask import Flask, jsonify

app = Flask(__name__)

@app.route('/api/document/info/<path:document_path>')
def get_document_info(document_path):
    result = docsray.peek(document_path, depth="structure")
    
    if "error" in result:
        return jsonify({"error": result["error"]}), 400
    
    return jsonify({
        "title": result["metadata"].get("title", "Untitled"),
        "pages": result["metadata"]["page_count"],
        "sections": result["structure"]["outline"],
        "provider": result["provider"]
    })

Command Line Tool

import argparse
import json

def main():
    parser = argparse.ArgumentParser(description='Peek at document')
    parser.add_argument('document', help='Path to document')
    parser.add_argument('--depth', choices=['metadata', 'structure', 'preview'], 
                       default='structure', help='Analysis depth')
    parser.add_argument('--json', action='store_true', help='Output as JSON')
    
    args = parser.parse_args()
    
    result = docsray.peek(args.document, depth=args.depth)
    
    if args.json:
        print(json.dumps(result, indent=2))
    else:
        if "error" in result:
            print(f"Error: {result['error']}")
            return 1
        
        metadata = result["metadata"]
        print(f"Title: {metadata.get('title', 'Untitled')}")
        print(f"Pages: {metadata['page_count']}")
        print(f"Size: {metadata['file_size']} bytes")
        print(f"Provider: {result['provider']}")

if __name__ == "__main__":
    main()

Next Steps

Learn about Map Tool for detailed document structure analysis
See Extract Tool for content extraction
Check Xray Tool for comprehensive AI analysis
Review API Reference for all parameters

Peek Tool

Overview​

Basic Usage​

Quick Document Check​

With Depth Control​

Parameters​

document_url (required)​

depth (optional)​

provider (optional)​

Response Structure​

Metadata Response (depth="metadata")​

Structure Response (depth="structure")​

Preview Response (depth="preview")​

Use Cases​

Document Validation​

File Browser Integration​

Batch Document Processing​

Document Triage​

Performance Characteristics​

Processing Speed by Depth​

Provider Performance​

Error Handling​

Common Errors​

Graceful Error Handling​

Advanced Usage​

Custom Provider Selection​

Conditional Processing​

Caching Strategy​

Best Practices​

Integration Examples​

Web Application​

Command Line Tool​

Next Steps​