The Enterprise Playbook
How vision-language models move from research to enterprise production. Architecture choices, evaluation, cost.
Download PDFRESOURCES · FIELD NOTES
FAQs, whitepapers, and writing from our engineering team on what actually moves an AI initiative from notebook to production.
Fixed scope, fixed timeline, fixed fee per package. No T&M surprises.
You do. Code, fine-tuned weights, and trained models are yours.
Yes — on-prem, your VPC, or managed SaaS. Your governance, your call.
GDPR-compliant by default; EU AI Act conformity; India DPDP; HIPAA where the engagement requires.
That's exactly what the Discovery Sprint decides. We're agnostic.
4–8 weeks for most use cases, depending on data readiness.
Both. Our fixed-fee model works for either.
Every production system ships with evaluation harnesses, guardrails, and human-in-the-loop escalation paths.
Yes — we run proprietary vernacular OCR and NLU across 22 Indian languages.
ML architect + ML engineers + MLOps + product + QA. Sized to scope.
Yes — the Managed AI Operations package covers monitoring, retraining, and roadmap.
We've shipped AI to 50M+ users. Most “AI consultancies” have shipped AI to slide decks.
WHITEPAPERS
Each whitepaper is a self-contained POV on a hard problem — written by the engineers who run that problem in production.
How vision-language models move from research to enterprise production. Architecture choices, evaluation, cost.
Download PDFScript-specific detection, Indic NLU, and the data engineering behind population-scale vernacular AI.
Download PDFThe operating-model failures that kill AI initiatives — and the engagement structure that gets past them.
Download PDFThe three-axis tradeoff that decides CV system design — and where consumer-scale deployment changes the math.
Download PDFFIELD NOTES
FEATURED · POST-TRAINING SERIES
FIELD NOTES · LLM TRAINING · 12 MIN READ
Most teams asking us "should we fine-tune or use RLHF" are asking the wrong question. The right one is: at what cost, at what scale, and against what failure mode?
Most teams asking us "should we fine-tune or use RLHF" are asking the wrong question. The right one is: at what cost, at what scale, and against what failure mode?
Here's what we've seen across roughly forty production engagements over the last eighteen months.
LoRA works when your domain shift is moderate, your data volume is in the thousands-to-low-tens-of-thousands of examples, and your latency budget doesn't require collapsing the adapter into the base weights. At our scale of consumer image GenAI, LoRA is the workhorse — brand adapters for Photoleaf, style adapters for LightX, vertical adapters for enterprise creative engines. Cost: typically <$2K per adapter on A100s. Quality ceiling: high for style and surface behavior, low for genuinely new capabilities.
When LoRA falls down: when the domain shift is large (e.g. radiology DICOMs vs natural images), when you need new tokenization (Indic scripts, code), or when the behavior you want isn't representable in your data — only in your preferences.
Full SFT is the right move when LoRA's representational capacity is the bottleneck — typically when you have >100K high-quality examples and a domain shift that LoRA can't bridge. Cost scales with parameter count: a 7B SFT run is usually $15–40K in compute; a 70B run is $200K+. Quality ceiling: significantly higher than LoRA on domain-specific accuracy, but it does nothing for behavior.
When full SFT falls down: when the failure mode isn't "wrong answer" but "right answer, wrong tone" or "right answer, unsafe phrasing." SFT can teach a model what to say. It cannot reliably teach it what not to say.
This is where most teams under-invest. The shift from "the model knows the domain" to "the model behaves correctly in deployment" almost always requires post-training alignment. We've shipped RLHF pipelines using PPO and reward models for two BFSI clients; DPO for three media and creative clients (faster to iterate, no separate reward model); and ORPO for one healthcare client where the preference data was scarce and we needed to combine SFT and preference optimization in a single pass.
Cost: surprisingly often dominated by preference data collection, not compute. Plan for $20–80K in human annotation for a serious RLHF pipeline. Compute itself is usually <$30K for a 7B model.
When RLHF falls down: when you don't actually have a behavioral failure mode — you have an accuracy failure mode. RLHF won't fix "the model doesn't know our product catalog." SFT will.
The common mistake: treating these as alternatives. In production, the right answer is usually all three in sequence — LoRA or SFT first, then alignment, then compression. The packages we sell are shaped around exactly that sequence.
— Sharad Shankar, Founder
FIELD NOTES · TRAINING OBSERVABILITY · 9 MIN READ
The expensive failures don't happen at the start of a training run. They happen forty hours in, on a Saturday, when no one's watching the dashboard.
The expensive failures don't happen at the start of a training run. They happen forty hours in, on a Saturday, when no one's watching the dashboard. Loss is still going down. Token efficiency looks normal. Then somewhere between checkpoint 18 and checkpoint 19, gradient norm spikes by 12x, and by the time anyone notices on Monday, you've written eighteen hours of garbage to disk and the next four checkpoints are unrecoverable.
This is the failure mode we built our mid-training observability practice to catch. Here's what we instrument and what we look at.
Loss is the dashboard everyone has. It's also the dashboard that will lie to you for the longest. By the time loss starts looking wrong, the divergence has been building for hours. The leading indicators we actually watch:
When gradient norm spikes:
When token efficiency drops:
When hardware utilization tanks:
The most expensive mistake in long training runs isn't the divergence itself. It's not having a clean rollback target. Every team we've worked with that lost meaningful compute did so because their checkpointing strategy was "save every N steps" without retention policy — and by the time they noticed the run was diverging, the last clean checkpoint had been overwritten.
Save more checkpoints than you think you need. Tier them — keep every 5th checkpoint indefinitely, every checkpoint for the last 24 hours. Storage is cheap. Re-running pretraining is not.
— AndOr training infrastructure team
FIELD NOTES · INFERENCE OPTIMIZATION · 11 MIN READ
A capable model is not a deployable model. The gap between "passing eval" and "running profitably at scale" is closed by post-training compression — and most teams get this wrong.
A capable model is not a deployable model. The gap between "passing eval" and "running profitably at scale" is closed by post-training compression — and most teams get this wrong by trying to do it all at once, or by treating it as an afterthought.
Here's what we've learned shipping compressed models to consumer-scale LightX traffic and to two BFSI clients running models in their VPCs.
The highest-impact lever, by a wide margin. A well-distilled student model at 1/4 the parameter count of its teacher will often retain 92–97% of the teacher's domain accuracy at 4x the throughput. We use distillation in nearly every production engagement.
Two failure modes we see constantly:
INT8 is essentially free for most modern architectures — do it. INT4 is where things get interesting. Quantization-aware training (QAT) almost always beats post-training quantization (PTQ) at INT4, but costs an additional fine-tuning pass.
When PTQ is enough: short-context inference, models where you can tolerate 1–2% accuracy loss, deployments where the cost saving matters more than the marginal quality.
When QAT is worth it: latency-critical deployments, models in regulated domains where accuracy regressions need to be defensible, long-context workloads where PTQ quality degrades nonlinearly.
Cheapest latency win in the stack if your traffic pattern fits. We've seen 2–3x throughput improvements on autoregressive workloads with no accuracy loss. The draft model can often be a 2x smaller version of your production model — even a quantized version of it.
When it doesn't help: very short outputs (the speculation overhead dominates), batch sizes >32 (you're already throughput-bound, not latency-bound), and workloads with high output entropy (acceptance rates collapse).
We almost always run these in this sequence:
The order matters. Quantizing before alignment is a common and expensive mistake — alignment doesn't transfer cleanly across precisions. Same for distilling before alignment.
Compression isn't about per-token cost. It's about end-to-end serving cost — including the engineering hours to maintain the compressed pipeline, the eval harness needed to detect quality regressions, and the rollback infrastructure for when compression turns out to have broken something subtle.
Budget for the operational tail, not just the compute saving.
— AndOr inference engineering team
MORE NOTES
MAR 2026
How we shrunk a foundation segmentation model into a 7MB student that runs in 80ms on mid-range Androids.
Read note →FEB 2026
An opinionated frame for when LoRA is enough — and the data-volume thresholds where you should pay for full SFT.
Read note →FEB 2026
A tour of why generic OCR underperforms on Devanagari and Bengali — and the architecture choices that fix it.
Read note →JAN 2026
Why hallucination, refusal, and modality leakage need separate evals — and how we structure them in production.
Read note →DEC 2025
Most RAG failures are retrieval failures, chunking failures, or eval failures — not LLM failures. A field debug guide.
Read note →DEC 2025
The scheduler tricks, model cascades, and quantization choices that let us serve 5M designs/month profitably.
Read note →NEXT STEP