8781  Reviews star_rate star_rate star_rate star_rate star_half

Operating AI in Production

Operate AI in production through release, observability, evaluation, and incident discipline. LLMOps release and deployment span pinned model snapshot IDs, prompt and configuration as artefacts,...

Read More
Duration 3 days
Course Code GAI-2703
Available Formats Classroom

Overview

Course Description

Operate AI in production through release, observability, evaluation, and incident discipline. LLMOps release and deployment span pinned model snapshot IDs, prompt and configuration as artefacts, blue-green and canary patterns for non-deterministic workloads, and the rollback patterns AI pipelines need. Pre-production gates and runtime observability draw on evaluation thresholds, safety checks, red-team gates, regression suites, OpenTelemetry GenAI semantic conventions as the emerging trace schema, prompt and response capture with PII redaction, and token-cost SLOs that catch unbounded consumption. Evaluation, drift, sourcing, and migration work through online evaluation, judge models with human calibration, distribution-shift and refusal-rate drift, vigilance for provider-pushed silent model changes, contract operationalisation that builds on the GAI-1701 sourcing decision, exit rehearsals, shadow traffic, and gradual rollout. Incident response and production-readiness address AI-specific runbooks, escalation queues for low-confidence outputs, EU AI Act post-market monitoring and serious-incident reporting, and the readiness synthesis for a workload go-live. Hands-on labs produce release-pipeline designs, gate-criteria drafts, observability instrumentation, drift-monitor designs, incident runbooks, and a capstone readiness review. The course is designed for operations architects, SRE leads, and AI architects accountable for production AI workloads.

Skills Gained

By the end of this course, participants will be able to:

  • Release plus deploy AI workloads using LLMOps patterns suited to non-deterministic systems
  • Design pre-production audit and quality gates that protect production from silent regressions
  • Instrument runtime observability tailored to AI workload characteristics
  • Detect drift plus quality degradation through online evaluation and feedback loops
  • Manage production sourcing, contracts, and migration to limit vendor lock-in
  • Respond to AI production incidents with AI-specific runbooks, escalation queues, and post-incident review

Who Can Benefit

This course is designed for:

  • Operations Architects
  • SRE & Reliability Leads
  • AI Architects
  • Technical Leads

Prerequisites

Participants should enter this course with:

  • GAI-1701 or equivalent
  • Practical experience operating production software systems

Organizational Objectives

This course assists organizations to:

  • Reduce production incidents in AI workloads through gated releases plus drift detection
  • Lower vendor lock-in risk by rehearsing model swaps under evaluation-driven controls
  • Improve mean-time-to-resolution for AI incidents through AI-specific runbooks and observability
  • Build cost predictability through workload-level operational telemetry and contract review
  • Establish a production-readiness discipline for AI workloads across the operations function

Software

All attendees must have a modern web browser and an Internet connection.

Course Details

Course Details

Releasing and Deploying AI Workloads with LLMOps

By the end of this module, you will be able to release AI workloads using LLMOps patterns suited to non-deterministic systems, treat prompts and configuration as versioned artefacts, and design rollback paths for workloads where deterministic rollback does not exist.

  • Model versioning with pinned snapshot IDs, never alias-only references
  • Prompt and configuration as versioned artefacts — registry, alias, rollback
  • Blue-green and canary patterns adapted for non-deterministic outputs
  • Feature flags scoped to prompt version, model snapshot, retrieval config, tool schema
  • Statistical-power sizing for LLM A/B tests
  • The release-pipeline shape — lint, offline eval, cost budget, shadow, canary
  • <b>Hands-on Lab:</b> Design a release pipeline for a candidate AI workload that names the artefact-versioning policy, the canary shape, the rollback path, and the feature-flag dimensions the pipeline supports.

Setting Pre-Production Audit and Quality Gates

By the end of this module, you will be able to design pre-production audit and quality gates that protect production from silent regressions, set evaluation thresholds tied to business outcomes, and run red-team gates that catch failure modes before users see them.

  • The five-gate pipeline — lint, offline eval, cost budget, shadow eval, canary
  • Evaluation thresholds tied to business outcomes, not abstract scores
  • Regression suites with golden-set governance and freshness checks
  • Red-team gates for prompt injection, jailbreak, and over-refusal
  • Safety checks — content classifiers, output policy, refusal calibration
  • Judge-vs-human calibration and the 75-90% agreement target
  • <b>Hands-on Lab:</b> Draft the gate-criteria pack for a candidate AI workload that names the threshold per gate, the golden set the regression suite uses, and the red-team scenarios the pipeline must pass.

Observing AI Workloads at Runtime

By the end of this module, you will be able to instrument runtime observability tailored to AI workload characteristics, name the must-instrument attributes the workload publishes, and enforce token-cost SLOs that catch unbounded consumption before it becomes an incident.

  • OpenTelemetry GenAI semantic conventions as the emerging trace schema
  • Must-instrument attributes — model, finish reason, input and output tokens, tool calls
  • Token-cost SLOs and per-tenant cost caps as first-class signals
  • Prompt and response capture with PII redaction and retention policy
  • Sampling strategies for high-volume LLM workloads
  • Correlation patterns linking AI telemetry to broader application traces
  • <b>Hands-on Lab:</b> Instrument observability for a candidate AI workload that publishes the OpenTelemetry GenAI attribute set, defines the token-cost SLO, and configures redaction and retention for prompt and response capture.

Evaluating AI Workloads and Detecting Drift

By the end of this module, you will be able to detect drift and quality degradation through online evaluation and feedback loops, design judge-LLM systems that survive position bias, and route feedback into a flywheel that improves the workload over time.

  • Online evaluation patterns — judge LLMs, pairwise position-bias mitigation, rubric design
  • Drift dimensions — input-distribution, semantic similarity, output quality, refusal rate, cost
  • Ground-truth feedback loops — thumbs and rubrics, low-confidence sampling
  • Judge calibration against a human-labelled golden set
  • Performance-regression bisection — prompt, retrieval, model, tools
  • The single eval suite that runs at dev, pre-release, and online
  • <b>Hands-on Lab:</b> Design a drift-monitor and feedback loop for a candidate AI workload that names the judge model, the drift dimensions watched, the feedback-capture path, and the bisection process when a regression triggers.

Managing Production Sourcing, Contracts, and Vendor Lock-In

By the end of this module, you will be able to manage production sourcing, contracts, and migration to limit vendor lock-in, operationalise the contract clauses set in GAI-1701, and rehearse exit so it stays a real option rather than an aspiration.

  • Contract operationalisation - the seam between initial sourcing and production
  • Model-provider service levels, deprecation notice, and fallback agreements
  • Exit-readiness rehearsals as scheduled work, not a one-time clause
  • Model-abstraction libraries and multi-provider routers as exit instruments
  • Portability tests - prompts, traces, eval suites, fine-tuning artefacts
  • Recognising when fine-tuning increases lock-in beyond contract protection
  • <b>Hands-on Lab:</b> Run an exit-readiness rehearsal for a candidate AI workload that scores each provider on contract maturity, portability, and the operational steps required to migrate within a documented timeframe.

Migrating, Modernizing, and Swapping Models in Production

By the end of this module, you will be able to migrate and modernise AI workloads through evaluation-driven cutover, defend production against provider-pushed silent model changes, and run shadow traffic that proves a candidate before users see it.

  • Evaluation-driven cutover — parallel eval suites that re-baseline a candidate
  • Shadow traffic — duplicate-request patterns, candidate-output isolation
  • Gradual rollout — 1-to-10 percent canary with behavioural SLO gates
  • Provider-pushed silent model changes — pin dated snapshots, watch alias drift
  • Version-pinning discipline and behavioural-canary on upstream version bumps
  • Migration costs — prompt-rewrite, tool-call schema diffs, fine-tune non-portability
  • <b>Hands-on Lab:</b> Plan a model migration for a candidate AI workload that names the shadow protocol, the cutover gates, the alias-monitoring rule, and the rollback path when the candidate fails an SLO.

Responding to AI Production Incidents

By the end of this module, you will be able to respond to AI production incidents with AI-specific runbooks and escalation queues, write the post-incident review that captures non-deterministic failure modes, and meet the EU AI Act post-market obligations that apply to your workload.

  • AI incident classes - hallucination, prompt injection, over-refusal, tool misuse, cost-runaway, provider outage, silent model change
  • AI-aware diagnosis tree - model vs prompt vs retrieval vs tools vs data
  • Escalation queues for low-confidence outputs and high-risk surfaces
  • Customer communication patterns for non-deterministic failure
  • EU AI Act Article 72 post-market monitoring and Article 55 serious-incident reporting
  • Post-incident review structure that links to the AI Incident Database where applicable
  • <b>Hands-on Lab:</b> Author AI-specific incident runbooks for a candidate workload covering three named incident classes plus the EU AI Act reporting path, and rehearse one runbook end-to-end with a tabletop scenario.

Production-Readiness and Operations Capstone

By the end of this module, you will be able to synthesise release, observability, evaluation, sourcing, migration, and incident response into a workload-readiness review, and produce the decommissioning plan that closes the workload’s lifecycle properly.

  • Production-readiness checklist - gates, observability, evaluation, runbooks, contracts
  • Workload go-live as a synthesised artefact, not a date
  • Model card and system card maintenance through the lifecycle
  • Structured decommissioning and sunset planning
  • Operational metrics that prove the workload remains fit-for-purpose
  • The seam between operations and the next architectural review
  • <b>Hands-on Lab:</b> Produce a capstone readiness review on a candidate AI workload that integrates the prior labs into one go-live artefact, names the decommissioning trigger, and lists the metrics that gate continued operation.

Schedule

FAQ

Does the course schedule include a Lunchbreak?

Classes typically include a 1-hour lunch break around midday. However, the exact break times and duration can vary depending on the specific class. Your instructor will provide detailed information at the start of the course.

What languages are used to deliver training?

Most courses are conducted in English, unless otherwise specified. Some courses will have the word "FRENCH" marked in red beside the scheduled date(s) indicating the language of instruction.

What does GTR stand for?

GTR stands for Guaranteed to Run; if you see a course with this status, it means this event is confirmed to run. View our GTR page to see our full list of Guaranteed to Run courses.

Does Ascendient Learning deliver group training?

Yes, we provide training for groups, individuals and private on sites. View our group training page for more information.

What does vendor-authorized training mean?

As a vendor-authorized training partner, we offer a curriculum that our partners have vetted. We use the same course materials and facilitate the same labs as our vendor-delivered training. These courses are considered the gold standard and, as such, are priced accordingly.

Is the training too basic, or will you go deep into technology?

It depends on your requirements, your role in your company, and your depth of knowledge. The good news about many of our learning paths, you can start from the fundamentals to highly specialized training.

How up-to-date are your courses and support materials?

We continuously work with our vendors to evaluate and refresh course material to reflect the latest training courses and best practices.

Are your instructors seasoned trainers who have deep knowledge of the training topic?

Ascendient Learning instructors have an average of 27 years of practical IT experience and have also served as consultants for an average of 15 years. To stay current, instructors spend at least 25 percent of their time learning new, emerging technologies and courses.

Do you provide hands-on training and exercises in an actual lab environment?

Lab access is dependent on the vendor and the type of training you sign up for. However, many of our top vendors will provide lab access to students to test and practice. The course description will specify lab access.

Will you customize the training for our company’s specific needs and goals?

We will work with you to identify training needs and areas of growth.  We offer a variety of training methods, such as private group training, on-site of your choice, and virtually. We provide courses and certifications that are aligned with your business goals.

How do I get started with certification?

Getting started on a certification pathway depends on your goals and the vendor you choose to get certified in. Many vendors offer entry-level IT certification to advanced IT certification that can boost your career. To get access to certification vouchers and discounts, please contact info@ascendientlearning.com.

Will I get access to content after I complete a course?

You will get access to the PDF of course books and guides, but access to the recording and slides will depend on the vendor and type of training you receive.

How do I request a W9 for Ascendient Learning?

View our filing status and how to request a W9.

Reviews

The technical data in the AWS Solutions Architect course was very thorough.

Great instructor, clear and concise course. Labs were easy to follow and worked perfectly.

Fantastic and great training. Tons of hands-on labs to really make you understand the material being thought.

It was very informative and covered all the required materials along with handson labs for practice.

vary good online learning. instructor is vary good the way he explained every thing.