Skip to content

Document Search

RAG-based document Q&A — Upload documents, ask questions in natural language, get answers with source citations.

Port 8100
Container hai-document-search
Use case Internal knowledge base, policy lookup, contract search, onboarding materials
Compliance GDPR, CCPA, HIPAA (with encryption enabled), SOC 2, ISO 27001

How it works

Upload (PDF/DOCX/TXT/MD)
    → Text extraction
    → Intelligent chunking (1000 chars, 200 overlap)
    → Vector embedding (ChromaDB)
    → Stored for semantic search

Query ("What is our refund policy?")
    → Embed query
    → Vector similarity search (top-k chunks)
    → LLM generates answer from retrieved context
    → Response with citations

API Endpoints

Upload a document

curl -X POST http://localhost:8100/api/v1/documents/upload \
  -H "X-API-Key: hai_your_key" \
  -F "file=@company-handbook.pdf"

Response:

{
  "document_id": "a1b2c3d4e5f6",
  "filename": "company-handbook.pdf",
  "size_bytes": 245760,
  "chunk_count": 47,
  "text_length": 52340,
  "status": "indexed"
}

Supported formats: PDF, DOCX, TXT, MD, CSV

Max file size: Configurable (default 50MB)

Search documents

curl -X POST http://localhost:8100/api/v1/search/query \
  -H "X-API-Key: hai_your_key" \
  -H "Content-Type: application/json" \
  -d '{"query": "What is the company refund policy?", "top_k": 5}'

Response:

{
  "answer": "According to the Company Handbook, the refund policy allows full refunds within 30 days of purchase. After 30 days, a 15% restocking fee applies. Digital products are non-refundable after download. [Source: company-handbook.pdf]",
  "confidence": 0.847,
  "sources": [
    {"filename": "company-handbook.pdf", "document_id": "a1b2c3d4e5f6", "relevance_score": 0.892}
  ],
  "query": "What is the company refund policy?",
  "model": "claude-sonnet-4-20250514",
  "latency_ms": 1243
}

Delete a document

curl -X DELETE http://localhost:8100/api/v1/documents/a1b2c3d4e5f6 \
  -H "X-API-Key: hai_your_key"

View statistics

curl http://localhost:8100/api/v1/documents/stats \
  -H "X-API-Key: hai_your_key"

Admin Endpoints

Requires Admin role.

Endpoint Method Description
/api/v1/admin/audit GET View audit log (filterable by user, action)
/api/v1/admin/gdpr/export/{user_id} GET Export all data for a user (GDPR Article 15)
/api/v1/admin/gdpr/delete/{user_id} DELETE Delete all data for a user (GDPR Article 17)
/api/v1/admin/storage GET View storage usage

Configuration

Variable Default Description
HAAGSMAN_CHROMA_PATH /data/chroma Vector database storage path
HAAGSMAN_MAX_UPLOAD_SIZE_MB 50 Maximum upload file size

HIPAA Considerations

If using Document Search with healthcare data (patient records, clinical notes):

HIPAA Requirements

  1. Ensure ENCRYPTION_KEY is set (AES-256 encryption at rest)
  2. Enable audit logging (enabled by default)
  3. Use RBAC to restrict access to authorized personnel only
  4. Deploy on HIPAA-eligible infrastructure (BAA with cloud provider)
  5. Do not send PHI to cloud LLM providers — use Ollama for air-gapped deployment

Performance

Metric Typical Value
Document indexing 2-5 seconds per page
Search query (with LLM) 1-3 seconds
Max documents Limited by disk space
Concurrent users 50+ (with 2 workers)