Document Search¶
RAG-based document Q&A — Upload documents, ask questions in natural language, get answers with source citations.
| Port | 8100 |
| Container | hai-document-search |
| Use case | Internal knowledge base, policy lookup, contract search, onboarding materials |
| Compliance | GDPR, CCPA, HIPAA (with encryption enabled), SOC 2, ISO 27001 |
How it works¶
Upload (PDF/DOCX/TXT/MD)
→ Text extraction
→ Intelligent chunking (1000 chars, 200 overlap)
→ Vector embedding (ChromaDB)
→ Stored for semantic search
Query ("What is our refund policy?")
→ Embed query
→ Vector similarity search (top-k chunks)
→ LLM generates answer from retrieved context
→ Response with citations
API Endpoints¶
Upload a document¶
curl -X POST http://localhost:8100/api/v1/documents/upload \
-H "X-API-Key: hai_your_key" \
-F "file=@company-handbook.pdf"
Response:
{
"document_id": "a1b2c3d4e5f6",
"filename": "company-handbook.pdf",
"size_bytes": 245760,
"chunk_count": 47,
"text_length": 52340,
"status": "indexed"
}
Supported formats: PDF, DOCX, TXT, MD, CSV
Max file size: Configurable (default 50MB)
Search documents¶
curl -X POST http://localhost:8100/api/v1/search/query \
-H "X-API-Key: hai_your_key" \
-H "Content-Type: application/json" \
-d '{"query": "What is the company refund policy?", "top_k": 5}'
Response:
{
"answer": "According to the Company Handbook, the refund policy allows full refunds within 30 days of purchase. After 30 days, a 15% restocking fee applies. Digital products are non-refundable after download. [Source: company-handbook.pdf]",
"confidence": 0.847,
"sources": [
{"filename": "company-handbook.pdf", "document_id": "a1b2c3d4e5f6", "relevance_score": 0.892}
],
"query": "What is the company refund policy?",
"model": "claude-sonnet-4-20250514",
"latency_ms": 1243
}
Delete a document¶
View statistics¶
Admin Endpoints¶
Requires Admin role.
| Endpoint | Method | Description |
|---|---|---|
/api/v1/admin/audit |
GET | View audit log (filterable by user, action) |
/api/v1/admin/gdpr/export/{user_id} |
GET | Export all data for a user (GDPR Article 15) |
/api/v1/admin/gdpr/delete/{user_id} |
DELETE | Delete all data for a user (GDPR Article 17) |
/api/v1/admin/storage |
GET | View storage usage |
Configuration¶
| Variable | Default | Description |
|---|---|---|
HAAGSMAN_CHROMA_PATH |
/data/chroma |
Vector database storage path |
HAAGSMAN_MAX_UPLOAD_SIZE_MB |
50 |
Maximum upload file size |
HIPAA Considerations¶
If using Document Search with healthcare data (patient records, clinical notes):
HIPAA Requirements
- Ensure
ENCRYPTION_KEYis set (AES-256 encryption at rest) - Enable audit logging (enabled by default)
- Use RBAC to restrict access to authorized personnel only
- Deploy on HIPAA-eligible infrastructure (BAA with cloud provider)
- Do not send PHI to cloud LLM providers — use Ollama for air-gapped deployment
Performance¶
| Metric | Typical Value |
|---|---|
| Document indexing | 2-5 seconds per page |
| Search query (with LLM) | 1-3 seconds |
| Max documents | Limited by disk space |
| Concurrent users | 50+ (with 2 workers) |