Scaling & Optimization Tips
When operating at scale, cost and latency become primary concerns. The recommendations below help optimize resource use and maintain throughput.
- Use visual compression first: Compress and tokenize images to reduce downstream processing and model token usage.
- Prefer small models for the common cases: Let small, efficient LLMs handle ~95% of routine parsing and mapping tasks.
- Batch non-urgent jobs: Schedule low-priority or large-batch conversions overnight to benefit from lower compute demand.
- Cache FHIR templates: Precompile and cache common FHIR templates per document type to avoid regenerating templates repeatedly.
- Escalate only on low confidence: Use model-confidence thresholds to route ambiguous cases for manual review or larger models.
Operational Metrics to Track
- Job queue length and average wait time
- OCR success/failure rates and confidence distributions
- Average processing time per document by document type
- GPU utilization for Vision Processor and cost-per-document