What does an Apertus on-prem deployment actually include?

The engagement covers install, GPU sizing against your real prompt mix, inference server setup with vLLM or TGI, monitoring wired into your observability stack, load testing and a production SLA. Discovery and an Apertus evaluation POC run upstream of the rollout.

Which GPU configuration does Apertus need in production?

Sizing is decided in discovery against your real prompt mix, concurrency and latency targets. Apertus on-prem clusters typically use H100, A100 or L40S class silicon; the 8B tolerates smaller and quantized footprints, the 70B lands on heavier hardware.

vLLM vs Text-Generation-Inference vs llama.cpp — which one, when?

vLLM is the default for high-throughput server-side inference. TGI fits where the Hugging Face operator surface is already in use. llama.cpp stays reserved for edge or developer laptops, not production. The choice rides on your concurrency and ops profile.

How is the Apertus cluster monitored on-prem?

Apertus is wired into your existing Prometheus and Grafana, with alerting routed to on-call. Metrics cover GPU utilisation, queue depth, prompt and completion lengths, request latency and error classes. Your platform team sees inference as any other service.

What does a production SLA for self-hosted Apertus look like?

The SLA is shaped to your workload class — interactive chat, batch enrichment or background RAG — and covers response targets, on-call windows, incident classification and escalation paths. Targets are agreed in discovery against the app's latency budget.

How long from order to live Apertus inference, typically?

A typical Apertus production deployment runs six to ten weeks from signed engagement to live inference, assuming GPUs are available. An evaluation POC sits upstream; fine-tuning and RAG sit downstream.

Home

Apertus Swiss LLM Deployment and Integration Partner

Apertus On-Prem Deployment

Apertus Swiss LLM Deployment and Integration Partner

Apertus On-Prem Deployment

Run Apertus, the Swiss open-weights LLM, inside your own datacenter or a sovereign Swiss cloud. We size GPUs to your real prompt mix, install vLLM or TGI, wire monitoring into your stack, and stay on call through go-live.

What Apertus on-prem deployment includes

Discovery and inference install

We start with a working discovery — workload class, concurrency profile, latency budget, regulatory frame — and install the inference server in your estate. vLLM is the default for high-throughput Apertus on-prem deployment; TGI fits teams already on the Hugging Face surface. The install lands in your CI and image registry, not on a laptop.

GPU sizing to your prompt mix

Production sizing is never guessed from a model card. We benchmark Apertus 8B, and the 70B variant where the workload demands it, against your real prompt distribution and concurrency. Clusters land on H100, A100 or L40S class silicon; quantized variants extend the 8B class onto lighter hardware. Hardware is customer-procured or cloud only.

Monitoring wired into your stack

Apertus is wired into your existing Prometheus and Grafana, with alerting routed to your on-call system. We expose GPU utilisation, queue depth, prompt and completion lengths, request latency and error classes — the metrics your platform team already reads for every other service. No parallel observability silo.

Load testing to acceptance set

Before go-live the cluster is load-tested against a representative request mix and an acceptance set drawn from your real prompts. Sustained throughput, p95 and p99 latency, queue saturation and failure modes are measured under expected and stressed concurrency, then signed off jointly. Production starts on a benchmark, not on a hope.

Production SLA and on-call

The SLA is shaped to the workload class — interactive chat, batch enrichment or background RAG — and covers response targets, on-call coverage windows, incident classification and escalation paths. Targets sit on the latency budget the application can actually absorb, agreed in discovery rather than copied from a generic template.

Hand-off to your platform team

Knowledge transfer is part of the engagement, not a follow-on quote. Runbooks, scaling playbooks, capacity model, image-rebuild pipeline and incident procedures are written for your platform team and rehearsed with them. For the hosting alternative see Swiss data-sovereignty hosting.

How we deliver on-prem

Discovery and frame
Workload class, expected concurrency, latency budget and regulatory frame — FINMA, cantonal residency, healthcare regulation — are mapped in two to four weeks. The output is the sizing brief that drives every later step.
GPU sizing and supply
We size against your real prompt mix and concurrency, not a generic recipe. The plan names GPU class (H100, A100 or L40S range), node topology, network and storage budget, and whether the 8B or 70B variant fits.
Install and serving
vLLM is the default; TGI is picked where the Hugging Face operator surface is already in use; llama.cpp is reserved for edge. The install runs through your image pipeline, your secrets store and your CI, like any other service.
Monitoring and alerting
Apertus emits the metrics your stack already speaks. We wire Prometheus scraping, Grafana dashboards and alert rules into your existing observability, with on-call routing through whatever paging system you already use.
Load test and acceptance
We run a representative load test against your acceptance set under expected and stressed concurrency, capture p95 and p99 latency, sustained throughput and failure modes, and sign off jointly with your team.
Go-live and on-call
Cutover happens in a defined window with documented rollback. We stay engaged through the first production peak, hand over the runbooks and capacity model, and remain on call under the agreed SLA.

Workload class, expected concurrency, latency budget and regulatory frame — FINMA, cantonal residency, healthcare regulation — are mapped in two to four weeks. The output is the sizing brief that drives every later step.

We size against your real prompt mix and concurrency, not a generic recipe. The plan names GPU class (H100, A100 or L40S range), node topology, network and storage budget, and whether the 8B or 70B variant fits.

vLLM is the default; TGI is picked where the Hugging Face operator surface is already in use; llama.cpp is reserved for edge. The install runs through your image pipeline, your secrets store and your CI, like any other service.

Apertus emits the metrics your stack already speaks. We wire Prometheus scraping, Grafana dashboards and alert rules into your existing observability, with on-call routing through whatever paging system you already use.

We run a representative load test against your acceptance set under expected and stressed concurrency, capture p95 and p99 latency, sustained throughput and failure modes, and sign off jointly with your team.

Cutover happens in a defined window with documented rollback. We stay engaged through the first production peak, hand over the runbooks and capacity model, and remain on call under the agreed SLA.

Why run Apertus on-prem with us

On-prem is the work — the model is the easy part

We install the inference server, size the GPUs against your real prompt mix, wire monitoring into your existing Prometheus, and stay on call through the first production peak. On-prem is the work — the model is the easy part.

Swiss delivery for regulated workloads

The engineering team is in Switzerland and the EU; the deployment lands inside your datacenter or a sovereign Swiss cloud. The frame fits FINMA, cantonal data residency and healthcare regulation by design — we implement under those regimes, we do not sell the certification. For the hosting alternative see Swiss data-sovereignty hosting.

Sized on your prompt mix, not on a model card

GPU sizing is decided in discovery against the prompts your users actually send and the concurrency you actually see. Apertus on-prem deployment clusters live on H100, A100 or L40S class hardware; the 8B and 70B variants get chosen on workload, not on enthusiasm. Hardware stays customer-procured — no GPU reselling on our side.

One platform from POC through RAG to fine-tuning

The same engagement team that runs the production rollout also runs the upstream Apertus evaluation POC and the downstream fine-tuning and RAG integration phases. The platform spine — Laravel, Next.js, PostgreSQL with pgvector, Redis, Mastra orchestration — is the same one used across Apertus Swiss LLM and the broader AI consulting practice.

Frequently Asked Questions

The engagement covers install, GPU sizing against your real prompt mix, inference server setup with vLLM or TGI, monitoring wired into your observability stack, load testing and a production SLA. Discovery and an Apertus evaluation POC run upstream of the rollout.

About SAPIENTROQ

Hey there! I’m your AI assistant developed by SAPIENTROQ. I am a language model connected to a RAG database that contains information about the company. If you need insights on AI solutions, real use cases, or how AI can boost your business, please feel free to ask in any language you prefer.

Choose an option

Hey! I am AI agent developed by SAPIENTROQ 🤖

Interested in a solution?

We are glad to show you various options without any obligation.

Roland Kurmann

CEO, SAPIENTROQ

Book a call

Apertus On-Prem Deployment

What Apertus on-prem deployment includes

How we deliver on-prem

Discovery and frame

GPU sizing and supply

Install and serving

Monitoring and alerting

Load test and acceptance

Go-live and on-call