Operations · MLOps·May 2026·7 min read

MLOps vs DevOps for LLM Production: Where the Difference Starts.

Leadership wants “our usual CI/CD,” then learns the new model cut toxicity but broke legal Q&A fidelity [1]. DevOps proves the service is up; MLOps proves behaviour stays inside agreed quality — non-functional and statistical tests [2][3].

Pair with the weekly RAG scorecard, five RAG metrics, and the Nuqta Journal.

Definition: what sits on top of deploy?

Model registry, dataset lineage, pre/post quality metrics, rollback policy — vanilla DevOps does not author these alone [2].

Operational evidence.

Canary for an LLM compares quality distributions — not only HTTP 500 rate [3].

“A zero-error redeploy is not an ethical launch if the model regresses for a user segment.”

Numbers from the field.

In our post-launch reviews, rollback time on quality drift often beats first-build time — stop-loss is user trust [4].

Behavioural SLOs.

p95 latency.
Answer with cited support rate.
Toxicity ceiling — plus a weekly scoreboard [5].

Honest caveats.

Automation without labelled eval ships chaos faster [2].

Closing.

One-hour meeting: “What is our behavioural SLO?” No answer means you operate a server — see RAG scorecard.

Frequently asked questions.

Git enough for models? You need registries, tags, artefacts [2].
Same DevOps team? Maybe — tests differ [3].
RAG relation? Retrieval ops are part of assurance — metrics.
Different SRE for LLMs? Yes — quality incidents, not only uptime [4].
When to roll back? On agreed quality breach — not only first code bug [3].

Sources.

[1] Sato et al. — CD4ML (Thoughtworks).

[2] Google — MLOps.

[3] Breck et al. — ML Test Score.

[4] Nuqta — internal launch notes, May 2026.

[5] Nuqta — [RAG scorecard](/en/journal/rag-ops-weekly-scorecard-2026) — May 2026.

The weekly RAG scorecard before blaming the frontier model.
Four KPIs — recall@k, citation accuracy, p95 latency, drift — stamped every Monday keeps retrieval honest.
Five RAG metrics to check before you blame the LLM.
Before you raise model spend or switch vendors, measure retrieval, chunks, and escalation. Most production hallucination starts in documents and indexes — not parameter count.
What is RAG — and why your company bot answers like a stranger.
A practical guide to Retrieval-Augmented Generation: how your bot reads documents before answering, and why it costs 10× less than fine-tuning.
Inference vs training for LLMs — who pays for what.
Training might run once (or for many hours) and you pay a cluster bill. Inference runs forever and turns a model into a per-token Opex line. This article separates the two checkbooks so pilot budgets are not mixed with product bills [1].
Grafana for LLM stacks — what you must chart before you blame the GPU.
HTTP 200 is not cognition. Separate edge latency from inference backlog, KV pressure, retrieval lag, then token-dollar math on one executive wall.

Explore the hub

Arabic & AI

Arabic LLMs, model comparisons, and conversational agents.

Share this article

X (Twitter)LinkedIn WhatsApp

← Back to the JournalNuqta · Journal