RESOURCES · FIELD NOTES

Field notes from
shipping AI at scale.

Q: How do you handle data privacy and compliance?

GDPR by default; EU AI Act; India DPDP; HIPAA where required.

Q: Can you handle Indian languages and scripts?

Yes - proprietary vernacular OCR and NLU across 22 Indian languages.

FAQs, whitepapers, and writing from our engineering team on what actually moves an AI initiative from notebook to production.

FAQ Whitepapers Field notes Newsletter

FAQ

Twelve questions
we hear most often.

If yours isn't here, write to hello@andor.in.

01 How do you price engagements? +

Fixed scope, fixed timeline, fixed fee per package. No T&M surprises.

02 Who owns the IP we build together? +

You do. Code, fine-tuned weights, and trained models are yours.

03 Can you deploy on our infrastructure? +

Yes — on-prem, your VPC, or managed SaaS. Your governance, your call.

04 How do you handle data privacy and compliance? +

GDPR-compliant by default; EU AI Act conformity; India DPDP; HIPAA where the engagement requires.

05 What if we need fine-tuning vs RAG vs an off-the-shelf API? +

That's exactly what the Discovery Sprint decides. We're agnostic.

06 How quickly can you get a PoC running? +

4–8 weeks for most use cases, depending on data readiness.

07 Do you work with startups, or only large enterprises? +

Both. Our fixed-fee model works for either.

08 What about model hallucinations and safety? +

Every production system ships with evaluation harnesses, guardrails, and human-in-the-loop escalation paths.

09 Can you handle Indian languages and scripts? +

Yes — we run proprietary vernacular OCR and NLU across 22 Indian languages.

10 What's your typical team composition? +

ML architect + ML engineers + MLOps + product + QA. Sized to scope.

11 Do you provide ongoing support after launch? +

Yes — the Managed AI Operations package covers monitoring, retraining, and roadmap.

12 How is AndOr different from other AI consultancies? +

We've shipped AI to 50M+ users. Most “AI consultancies” have shipped AI to slide decks.

WHITEPAPERS

Long-form arguments
worth your time.

Each whitepaper is a self-contained POV on a hard problem — written by the engineers who run that problem in production.

WP-01 · 2026 VLMs in 2026
The Enterprise Playbook

How vision-language models move from research to enterprise production. Architecture choices, evaluation, cost.

Download PDF

WP-02 · 2026 Vernacular AI
Building NLU for India's 22 Languages

Script-specific detection, Indic NLU, and the data engineering behind population-scale vernacular AI.

Download PDF

WP-03 · 2026 From PoC to Production
Why 80% of Enterprise AI Stalls

The operating-model failures that kill AI initiatives — and the engagement structure that gets past them.

Download PDF

WP-04 · 2026 Computer Vision Economics
Cost, Latency, Accuracy Trade-offs

The three-axis tradeoff that decides CV system design — and where consumer-scale deployment changes the math.

Download PDF

FIELD NOTES

Short writing
from our engineering team.

FEATURED · POST-TRAINING SERIES

FIELD NOTES · LLM TRAINING · 12 MIN READ

LoRA vs Full SFT vs RLHF: a production cost-and-quality comparison

Most teams asking us "should we fine-tune or use RLHF" are asking the wrong question. The right one is: at what cost, at what scale, and against what failure mode?

Here's what we've seen across roughly forty production engagements over the last eighteen months.

LoRA and QLoRA

LoRA works when your domain shift is moderate, your data volume is in the thousands-to-low-tens-of-thousands of examples, and your latency budget doesn't require collapsing the adapter into the base weights. At our scale of consumer image GenAI, LoRA is the workhorse — brand adapters for Photoleaf, style adapters for LightX, vertical adapters for enterprise creative engines. Cost: typically <$2K per adapter on A100s. Quality ceiling: high for style and surface behavior, low for genuinely new capabilities.

When LoRA falls down: when the domain shift is large (e.g. radiology DICOMs vs natural images), when you need new tokenization (Indic scripts, code), or when the behavior you want isn't representable in your data — only in your preferences.

Full supervised fine-tuning

Full SFT is the right move when LoRA's representational capacity is the bottleneck — typically when you have >100K high-quality examples and a domain shift that LoRA can't bridge. Cost scales with parameter count: a 7B SFT run is usually $15–40K in compute; a 70B run is $200K+. Quality ceiling: significantly higher than LoRA on domain-specific accuracy, but it does nothing for behavior.

When full SFT falls down: when the failure mode isn't "wrong answer" but "right answer, wrong tone" or "right answer, unsafe phrasing." SFT can teach a model what to say. It cannot reliably teach it what not to say.

RLHF / DPO / ORPO

This is where most teams under-invest. The shift from "the model knows the domain" to "the model behaves correctly in deployment" almost always requires post-training alignment. We've shipped RLHF pipelines using PPO and reward models for two BFSI clients; DPO for three media and creative clients (faster to iterate, no separate reward model); and ORPO for one healthcare client where the preference data was scarce and we needed to combine SFT and preference optimization in a single pass.

Cost: surprisingly often dominated by preference data collection, not compute. Plan for $20–80K in human annotation for a serious RLHF pipeline. Compute itself is usually <$30K for a 7B model.

When RLHF falls down: when you don't actually have a behavioral failure mode — you have an accuracy failure mode. RLHF won't fix "the model doesn't know our product catalog." SFT will.

The decision framework we use with clients

What's the failure mode? Wrong answer → SFT or pretraining. Wrong behavior → alignment.
What's your data shape? <10K examples → LoRA. 10K–100K → LoRA or SFT depending on domain shift. >100K → SFT. Preferences instead of correct answers → alignment.
What's your latency budget? Sub-50ms → likely need distillation after any of the above.
What's your inference scale? At >10M requests/day, the cost of compression and routing dominates everything else.

The common mistake: treating these as alternatives. In production, the right answer is usually all three in sequence — LoRA or SFT first, then alignment, then compression. The packages we sell are shaped around exactly that sequence.

— Sharad Shankar, Founder

FIELD NOTES · TRAINING OBSERVABILITY · 9 MIN READ

Mid-training debugging: catching divergence at hour 40

The expensive failures don't happen at the start of a training run. They happen forty hours in, on a Saturday, when no one's watching the dashboard.

The expensive failures don't happen at the start of a training run. They happen forty hours in, on a Saturday, when no one's watching the dashboard. Loss is still going down. Token efficiency looks normal. Then somewhere between checkpoint 18 and checkpoint 19, gradient norm spikes by 12x, and by the time anyone notices on Monday, you've written eighteen hours of garbage to disk and the next four checkpoints are unrecoverable.

This is the failure mode we built our mid-training observability practice to catch. Here's what we instrument and what we look at.

What we track (beyond loss)

Loss is the dashboard everyone has. It's also the dashboard that will lie to you for the longest. By the time loss starts looking wrong, the divergence has been building for hours. The leading indicators we actually watch:

Gradient norm — sudden 5x+ spikes are the canonical divergence signal. Set alerts at 3x rolling baseline.
Gradient norm variance across the batch — rising variance with stable mean is often a sign of a few pathological examples poisoning the run.
Token efficiency (useful tokens / total tokens) — drops here usually mean your data loader is starting to feed degenerate sequences.
Hardware utilization — sudden drops mean you're stalling on I/O, which often means your dataset shard is corrupt.
Activation norms at chosen layers — for long runs, these drift in characteristic ways before loss does.

Intervention playbook

When gradient norm spikes:

Pause the run (not kill — pause from a checkpoint).
Inspect the most recent batch. 90% of the time it's bad data.
If data looks clean, drop LR by 10x and resume from the pre-spike checkpoint.
If that fails twice, the issue is architectural — likely a numerical stability problem you should have caught in the ablation phase.

When token efficiency drops:

Check the shard your loader is currently on.
Validate against your data quality checksum (you do have one).
Skip the shard if necessary — never silently downsample.

When hardware utilization tanks:

Check disk I/O. Then network. Then your gradient synchronization step.
NCCL hangs are the silent killer. Set explicit timeouts and dump rank-level state on timeout.

The thing no one tells you

The most expensive mistake in long training runs isn't the divergence itself. It's not having a clean rollback target. Every team we've worked with that lost meaningful compute did so because their checkpointing strategy was "save every N steps" without retention policy — and by the time they noticed the run was diverging, the last clean checkpoint had been overwritten.

Save more checkpoints than you think you need. Tier them — keep every 5th checkpoint indefinitely, every checkpoint for the last 24 hours. Storage is cheap. Re-running pretraining is not.

— AndOr training infrastructure team

FIELD NOTES · INFERENCE OPTIMIZATION · 11 MIN READ

Post-training cost engineering: distillation, quantization, and speculative decoding in real deployments

A capable model is not a deployable model. The gap between "passing eval" and "running profitably at scale" is closed by post-training compression — and most teams get this wrong.

A capable model is not a deployable model. The gap between "passing eval" and "running profitably at scale" is closed by post-training compression — and most teams get this wrong by trying to do it all at once, or by treating it as an afterthought.

Here's what we've learned shipping compressed models to consumer-scale LightX traffic and to two BFSI clients running models in their VPCs.

Distillation

The highest-impact lever, by a wide margin. A well-distilled student model at 1/4 the parameter count of its teacher will often retain 92–97% of the teacher's domain accuracy at 4x the throughput. We use distillation in nearly every production engagement.

Two failure modes we see constantly:

Distilling from a teacher that hasn't been aligned. The student inherits all the misalignment, plus a smaller capacity to recover from it. Align first, distill second.
Distilling on the wrong data distribution. Your distillation corpus should look like production traffic, not like your training data. This is non-obvious and routinely overlooked.

Quantization

INT8 is essentially free for most modern architectures — do it. INT4 is where things get interesting. Quantization-aware training (QAT) almost always beats post-training quantization (PTQ) at INT4, but costs an additional fine-tuning pass.

When PTQ is enough: short-context inference, models where you can tolerate 1–2% accuracy loss, deployments where the cost saving matters more than the marginal quality.

When QAT is worth it: latency-critical deployments, models in regulated domains where accuracy regressions need to be defensible, long-context workloads where PTQ quality degrades nonlinearly.

Speculative decoding

Cheapest latency win in the stack if your traffic pattern fits. We've seen 2–3x throughput improvements on autoregressive workloads with no accuracy loss. The draft model can often be a 2x smaller version of your production model — even a quantized version of it.

When it doesn't help: very short outputs (the speculation overhead dominates), batch sizes >32 (you're already throughput-bound, not latency-bound), and workloads with high output entropy (acceptance rates collapse).

The order that works

We almost always run these in this sequence:

Align the full-size model
Distill to target size
Quantize (QAT if you can afford the training pass, PTQ if not)
Layer speculative decoding on top if traffic pattern fits

The order matters. Quantizing before alignment is a common and expensive mistake — alignment doesn't transfer cleanly across precisions. Same for distilling before alignment.

The cost calculation teams get wrong

Compression isn't about per-token cost. It's about end-to-end serving cost — including the engineering hours to maintain the compressed pipeline, the eval harness needed to detect quality regressions, and the rollback infrastructure for when compression turns out to have broken something subtle.

Budget for the operational tail, not just the compute saving.

— AndOr inference engineering team

MORE NOTES

MAR 2026

Distilling SAM for consumer phones

How we shrunk a foundation segmentation model into a 7MB student that runs in 80ms on mid-range Androids.

Read note →

FEB 2026

LoRA vs full SFT: the decision rule we actually use

An opinionated frame for when LoRA is enough — and the data-volume thresholds where you should pay for full SFT.

Read note →

FEB 2026

Indic OCR: conjuncts, ligatures, and what breaks vanilla CRNNs

A tour of why generic OCR underperforms on Devanagari and Bengali — and the architecture choices that fix it.

Read note →

JAN 2026

Designing eval harnesses for VLMs

Why hallucination, refusal, and modality leakage need separate evals — and how we structure them in production.

Read note →

DEC 2025

RAG isn't a model. It's a system.

Most RAG failures are retrieval failures, chunking failures, or eval failures — not LLM failures. A field debug guide.

Read note →

DEC 2025

Generative cost engineering at consumer ARPU

The scheduler tricks, model cascades, and quantization choices that let us serve 5M designs/month profitably.

Read note →

NEXT STEP

Have an AI initiative
that needs to actually ship?

Book a 30-minute consultation hello@andor.in

Field notes from shipping AI at scale.

Twelve questionswe hear most often.

Long-form argumentsworth your time.

Short writingfrom our engineering team.

LoRA vs Full SFT vs RLHF: a production cost-and-quality comparison

LoRA and QLoRA

Full supervised fine-tuning

RLHF / DPO / ORPO

The decision framework we use with clients

Mid-training debugging: catching divergence at hour 40

What we track (beyond loss)

Intervention playbook

The thing no one tells you

Post-training cost engineering: distillation, quantization, and speculative decoding in real deployments

Distillation

Quantization

Speculative decoding

The order that works

The cost calculation teams get wrong

Distilling SAM for consumer phones

LoRA vs full SFT: the decision rule we actually use

Indic OCR: conjuncts, ligatures, and what breaks vanilla CRNNs

Designing eval harnesses for VLMs

RAG isn't a model. It's a system.

Generative cost engineering at consumer ARPU

One field note. Once a month.

Have an AI initiativethat needs to actually ship?

Field notes from
shipping AI at scale.

Twelve questions
we hear most often.

Long-form arguments
worth your time.

Short writing
from our engineering team.

Have an AI initiative
that needs to actually ship?