LiteLLM × Bedrock · From production

Turn Bedrock into a production model gateway,
hardened layer by layer

Customers want a single OpenAI / Anthropic-compatible entry point backed by Amazon Bedrock, with central control over keys, cost, and rate limits, and without touching each vendor's SDK. LiteLLM sits exactly there. The hard part isn't installing LiteLLM. It's that as the customer's requirements for network isolation and account boundaries grow, the model config has to grow with them, one layer at a time. The four layers below: configure exactly as far as the customer needs.

L1
Public endpoint
Pod has internet egress, hits Bedrock's public endpoint
model + region
▲ add private networking
L2
In-region VPC Endpoint
No internet egress; traffic goes through an in-region VPCE
+ VPCE
▲ add cross-region
L3
Cross-region US Inference Profile
Reach us-west-2 privately over VPC Peering
+ VPC Peering
▲ add cross-account
L4
Cross-account, unified
Many AWS accounts, one gateway issuing keys and billing
+ AssumeRole
Field notes from a LiteLLM proxy that has run in production and served multiple customers. Account IDs, VPC endpoints, hostnames, and secrets have all been redacted to placeholders (<ACCOUNT_B>, vpce-xxxxx). You can build from this; it contains nothing that points to a real resource.
Quick Start

Validate the minimal path: one Pod, public egress

Private networking, cross-region, and cross-account can all be introduced later. The fastest way to validate is to let one Pod with internet egress call Bedrock's public endpoint directly, get the full "client → gateway → Bedrock" path working, then harden it layer by layer. Three steps follow.

Quick Start · the minimal path
Clientuses virtual key ALBtimeout → 600s Amazon EKSLiteLLM Pod internetPod has egress Bedrockglobal.*
  1. Configure one model. Put a model ID and region in model_list; let EKS Pod Identity inject credentials, with no access key written anywhere.
  2. Point the client here. Have the app or Claude Code set its base URL to the gateway and authenticate with a LiteLLM virtual key.
  3. Raise the timeouts up front. Set the load balancer and LiteLLM timeouts to 600 seconds from the start, or long conversations get cut off mid-stream (the single most common failure, covered in its own section below).
Smallest runnable model_list
model_list:
  - model_name: claude-sonnet-4-6
    litellm_params:
      model: bedrock/global.anthropic.claude-sonnet-4-6
      aws_region_name: ap-northeast-1
      drop_params: true
Do this on day one

Raise every timeout in the path to 600 seconds and align them. The ALB's idle timeout defaults to just 60 seconds, while long conversations (extended thinking, multi-turn agents) routinely run tens of seconds to minutes; once 60 seconds pass with no new data, the connection is dropped while LiteLLM is still waiting on the model perfectly normally, which is confusing to debug. See the Timeouts & LB section.

This minimal path is simply layer one of the four below. Once it's working, look at what LiteLLM does in this architecture and how the whole thing is structured, then return to the four layers to harden it.

00
What LiteLLM is here

LiteLLM's role in this architecture

Many people first hear of LiteLLM as an SDK you drop into code to call models. It does have an SDK form, but this architecture uses the other side of it: the Proxy Server, a standalone service process that exposes two standard HTTP interfaces, /v1/chat/completions (OpenAI format) and /v1/messages (Anthropic format), and internally translates requests into each vendor's API. Sitting between the customer and Bedrock, it solves four very concrete problems.

01
One entry point

The customer's apps, Claude Code, and scripts all point at the same endpoint with the same key. Switching models, adding models, tuning routing all happen in the gateway config; the client never changes a line.

02
Credentials contained

Bedrock IAM credentials and cross-account AssumeRole stay locked inside the gateway Pod. Clients only ever see a LiteLLM virtual key, never AWS credentials; revoking a customer is just deleting a key.

03
Cost and usage visible

LiteLLM's built-in spend log records, per request, which model ran, how many tokens, and the dollar cost, all persisted to a database. Who uses the most and which model costs the most is right there.

04
Model differences smoothed

Claude on Bedrock differs from the first-party API in call format, thinking parameters, and beta headers. The gateway absorbs those differences so clients just send the standard format.

Architecture

The full picture

The diagram below shows all four config layers stacked together. The path comes in from the client, through the ALB to the LiteLLM Pods on EKS, then out to Bedrock along three different routes depending on the model, with account state persisted in Aurora.

Architecture · the full path with all four layers stacked
Workload VPC · in-region us-west-2 VPC Account B · <ACCOUNT_B> HTTPS · virtual key spend log in-region private cross-region VPC Peering cross-account AssumeRole Clientapp / Claude Code / scripts ALBinbound IP allow-list Amazon EKS creds via Pod Identity LiteLLM Podreplica 1 LiteLLM Podreplica 2 Aurora PostgreSQLconfig / keys / spend log Bedrock VPCEin-region private Bedrockglobal.* · in-region Bedrock VPCE Bedrockus.* Bedrockseparate billing
Client Networking · ALB / VPC / VPCE / Peering Containers · EKS Database · Aurora Bedrock Cross-account · AssumeRole

A few design choices worth calling out. EKS over a single EC2 is about availability: if the gateway goes down, every customer's access drops at once, which is unacceptable. Two replicas roll over during updates, so upgrading the LiteLLM version causes zero downtime; a single EC2 is kept as a cold standby to take over on a full cluster failure. The ALB locks inbound IPs because what sits behind it is billable Bedrock traffic. An open 0.0.0.0/0 security group would expose a high-cost entry point to the internet, and a leaked virtual key would let anyone call it, so only the customer's known IPs are allowed. That's a hard rule.

A common follow-up: why an ALB out front, rather than a CloudFront layer like an ordinary web service? Two reasons. First, timeouts: CloudFront's origin response timeout defaults to 30 seconds and can be raised only to a 120-second ceiling (and that needs a quota request), whereas multi-turn agent and extended-thinking conversations easily run for minutes; CloudFront can't hold a connection that long. An ALB's idle timeout goes up to 4000 seconds, leaving headroom. Second, it's the wrong tool: CloudFront is built for edge-distributing cacheable content, but gateway traffic is all authenticated, all-distinct POST requests with nothing to cache, so adding it only buys an extra hop of latency and cost. A layer-7 load balancer (ALB) is the right fit in front of the gateway.

Pod Identity over IRSA: both inject AWS credentials into Pods, but Pod Identity is simpler to set up and natively supports attaching transitive session tags, a property that matters in the layer-4 cross-account case. Aurora PostgreSQL Serverless v2 backs store_model_in_db and the spend log, with both Pods sharing one record set; it scales with load and costs little when gateway traffic is light. The Pods are deliberately small (250m CPU / 1Gi memory, limits 500m / 2Gi): the LiteLLM proxy is IO-bound, so the bottleneck is network and concurrent connections, not CPU. The securityContext is tightened to least privilege, dropping all Linux capabilities and disabling privilege escalation.

L1
Layer 1 · Public Endpointsimplest

Pod has internet egress, hit Bedrock's public endpoint

The simplest case: the Pod has internet egress. A model then needs only two things, a model ID and a region. Credentials are injected automatically by EKS Pod Identity (using the workload account's own permissions), so LiteLLM never configures an access key.

model_list · layer 1
model_list:
  - model_name: claude-sonnet-4-6
    litellm_params:
      model: bedrock/global.anthropic.claude-sonnet-4-6
      aws_region_name: ap-northeast-1
      drop_params: true
L1 · Pod reaches Bedrock over the internet
Clientapp / Claude Code ALBinbound IP allow-list Amazon EKSLiteLLM Pod internetPod has egress Bedrockglobal.*

drop_params: true discards OpenAI parameters Bedrock doesn't support, avoiding a 400. Worth explaining the model ID forms here, since they are Bedrock's rules and getting them wrong fails outright: the global.* prefix in global.anthropic.claude-sonnet-4-6 is a cross-region inference profile, where Bedrock routes the request to whichever region has capacity. It has the best availability and is AWS's default recommendation. It must be called through an inference profile, never a bare base model ID.

At this layer the customer can already use it. But traffic rides Bedrock's public endpoint, and many customers' security and compliance won't clear that.

L2
Layer 2 · Same-Region VPCE+ private

No internet egress, route through an in-region VPC Endpoint

To tighten security, the goal is a Pod with no internet access at all. The way to do it is to create a Bedrock VPC Endpoint (VPCE) in the Pod's region so traffic stays on AWS's private network the whole way. The config adds just one line, aws_bedrock_runtime_endpoint, pointing at that VPCE.

model_list · layer 2
  - model_name: claude-sonnet-4-6
    litellm_params:
      model: bedrock/global.anthropic.claude-sonnet-4-6
      aws_region_name: ap-northeast-1
      aws_bedrock_runtime_endpoint: https://vpce-xxxxx.bedrock-runtime.ap-northeast-1.vpce.amazonaws.com
      drop_params: true
L2 · in-region VPCE, fully private
Workload VPC · in-region · Pod has no internet route private AWS backbone ALB LiteLLM PodEKS · no internet Bedrock VPCEPrivate DNS · 443 Bedrockglobal.*

Three things to prepare on the AWS side: create the com.amazonaws.<region>.bedrock-runtime interface endpoint in the workload VPC with Private DNS enabled; allow inbound 443 from the Pod subnet on the VPCE security group; and remove the internet route from the Pod subnet (or simply place it in a private subnet) so Bedrock is reachable only through the VPCE.

At this layer the VPCE and the Pod live in the same VPC, so Private DNS works and the default service hostname would actually resolve to the VPCE too. Writing the VPCE hostname explicitly in the config mainly makes the traffic path obvious, and sets up the habit for the cross-VPC case in layer 3.

L3
Layer 3 · Cross-Region US Profile+ cross-region private

Reach a US Inference Profile cross-region, fully private

AWS recommends global.* by default, but some customers want the inference entry pinned to the US, using a US Inference Profile (the us.* prefix), while still keeping everything off the public internet. The reason is that global.* routes globally and can show occasional backend latency swings; pinning to us.* (say us-west-2) gives steadier behavior. But a us.* call's entry point must land in a US region; you can't dispatch it from an APAC endpoint.

To reach us-west-2 Bedrock cross-region and stay fully private, set up a cross-region VPC Peering between the workload VPC (say Tokyo) and a us-west-2 VPC, put the Bedrock VPCE on the us-west-2 side, and let traffic cross privately over the peering.

L3 · cross-region US profile, fully private over VPC Peering
Tokyo VPC · 10.2.0.0/16 us-west-2 VPC · 10.1.0.0/16 cross-region VPC Peering · private ALB LiteLLM Podaws_region=us-west-2 Bedrock VPCEvpce-…us-west-2 (own hostname) Bedrockus.* · US Inference Profile
model_list · layer 3
  - model_name: claude-opus-4-8-us
    litellm_params:
      model: bedrock/us.anthropic.claude-opus-4-8
      aws_region_name: us-west-2
      aws_bedrock_runtime_endpoint: https://vpce-usw2-xxxxx.bedrock-runtime.us-west-2.vpce.amazonaws.com
      drop_params: true

Three things in the network layer have to be right:

Set expectations on latency: a trans-Pacific peering hop adds roughly 100–150ms over an in-region VPCE, mostly in time-to-first-token. Once the stream's connection is established, later tokens are barely affected. To keep flexibility, configure us.* as a fallback for global.*: run global normally and switch over only when needed.

L4
Layer 4 · Cross-Account+ cross-account

Many AWS accounts reach Bedrock, managed through one LiteLLM

If a customer has several AWS accounts that all need Bedrock but wants one gateway to manage them, issue keys, and bill centrally, the answer is cross-account AssumeRole: the gateway Pod first assumes a role in the target account, then calls that account's Bedrock with the temporary credentials. Each account's Bedrock bill stays separate, which keeps reconciliation clean. The config adds one line, aws_role_name, pointing at the target account's cross-account role.

model_list · layer 4
  - model_name: claude-sonnet-4-6-acct-b
    litellm_params:
      model: bedrock/global.anthropic.claude-sonnet-4-6
      aws_region_name: ap-northeast-1
      aws_role_name: arn:aws:iam::<ACCOUNT_B>:role/LiteLLM-Bedrock-CrossAccount-Role
      aws_session_name: bedrock-session
      aws_bedrock_runtime_endpoint: https://vpce-xxxxx.bedrock-runtime.ap-northeast-1.vpce.amazonaws.com
      drop_params: true
L4 · cross-account AssumeRole call chain
Workload account · EKS Account B · <ACCOUNT_B> sts:AssumeRole + TagSession InvokeModel LiteLLM PodPod Identity Pod Rolein-account IAM Cross-Account Roletrusts workload Pod Role Bedrockseparate billing per account

IAM has to be configured in both accounts. The workload account's Pod Role permission policy adds sts:AssumeRole on the target role, and its trust policy allows pods.eks.amazonaws.com to assume it (the standard EKS Pod Identity trust principal). The target account holds a cross-account role whose trust policy allows the workload account's Pod Role to assume it, with a permission policy granting Bedrock calls.

Easy to miss

Both policies must include sts:TagSession, paired with sts:AssumeRole. When EKS Pod Identity injects credentials, it attaches a set of transitive session tags, and those tags are carried along when the Pod assumes the cross-account role. If the target account's trust policy allows only sts:AssumeRole without sts:TagSession, the call returns AccessDenied outright.

If the customer also requires STS itself to be private (no public AssumeRole), just add an STS VPC Endpoint (with Private DNS) in the Pod's region. AssumeRole is initiated by the Pod locally, so the in-region STS VPCE is hit naturally, and there's no need to create one in the peer region.

+
Stacking & Special IDs

Stacking the layers, plus two special model IDs

The four layers are orthogonal and combine freely as needed: cross-region US profile plus cross-account plus fully private is just layers 2, 3, and 4 stacked, which is the architecture diagram above. Beyond global.* and us.*, two more model ID forms come up.

bare ID
Open-weight models on Bedrock

Models like GLM and Kimi don't support cross-region inference profiles, so conversely they must use a bare model ID with no global./us. prefix, called through the bedrock/converse/ path. As a consequence, the Pod's IAM policy must add a foundation-model ARN per model; the wildcard ARN for the Claude family doesn't cover them.

AIP
Application Inference Profile

For fine-grained cost attribution (e.g. AWS MAP credits), an AIP tags calls so they can be tracked, using local credentials. The constraint: an AIP can only wrap a base model that actually exists in a given region, not a global.* cross-region profile, so whether you can use one depends on whether the target region has that model's regional base model.

A few global settings worth reusing

litellm_settings, etc.
litellm_settings:
  drop_params: true        # drop params Bedrock rejects, avoid 400
  request_timeout: 600     # headroom for long inference
  num_retries: 2           # retry transient failures
  fallbacks:               # per-model downgrade chain on call failure
    - claude-opus-4-6: [claude-opus-4-5, claude-sonnet-4-5]
    - claude-sonnet-4-6: [claude-sonnet-4-5]
  context_window_fallbacks: # switch to a larger-window variant when over limit
    - claude-sonnet-4-5: [claude-4-5-sonnet-1M]

general_settings:
  store_model_in_db: true
  store_prompts_in_spend_logs: true

fallbacks and context_window_fallbacks are two different things: the former reroutes to a backup model when a call fails (throttling, errors) to preserve availability; the latter switches to a larger-window variant when the request's context exceeds the current model's limit (say from a 200K Sonnet to a 1M Sonnet). One handles "the call failed," the other handles "it doesn't fit."

Client Setup · Claude Code

Pointing local Claude Code at the gateway

Many customers' developers use Claude Code locally, and it connects to the first-party Anthropic API by default. Routing it through the self-hosted gateway takes two key steps: point it at the gateway URL, then feed in the virtual key with an apiKeyHelper script. Here's the easy trap: setting only an ANTHROPIC_AUTH_TOKEN in env often won't work. A static token is sent only as the Authorization header, but LiteLLM's virtual-key check reads x-api-key; an apiKeyHelper's output is sent as both the Authorization and X-Api-Key headers, which is what makes the setup actually authenticate. Claude Code talks to the /v1/messages (Anthropic format) entry point.

~/.claude/settings.json
{
  "apiKeyHelper": "~/.claude/litellm-key.sh",
  "env": {
    "ANTHROPIC_BASE_URL":            "https://<your LiteLLM gateway>",
    "ANTHROPIC_DEFAULT_OPUS_MODEL":   "claude-opus-4-8",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "claude-sonnet-4-6",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL":  "claude-haiku-4-5"
  }
}
~/.claude/litellm-key.sh · remember chmod +x
#!/bin/bash
# Simplest: just echo the virtual key
echo "<your LiteLLM virtual key>"

Each setting does one job. apiKeyHelper points to a script that prints the virtual key; Claude Code runs it at startup to get the key and sends it as both the Authorization and X-Api-Key headers, and without it a bare env token often fails to authenticate. ANTHROPIC_BASE_URL routes every request to the gateway ALB instead of straight to Anthropic, and the key is a virtual key, not AWS credentials, so the client never sees the underlying Bedrock permissions. The three ANTHROPIC_DEFAULT_*_MODEL values map Claude Code's built-in opus / sonnet / haiku tiers to the model_name entries in the LiteLLM model_list; they must match exactly, or the gateway can't find the model when you switch tiers.

Once it's set

In a session, /model sonnet and /model opus switch tiers instantly, with no client code changes. For rotating keys, change the script to pull the key from a vault and set a refresh interval with CLAUDE_CODE_API_KEY_HELPER_TTL_MS. To check the link works, send a single curl to the gateway's /v1/messages and confirm it returns 200.

Thinking Parameters

Thinking parameters split by model generation

Claude's extended thinking on Bedrock takes a parameter format that varies by model generation, and getting it wrong produces "looks like thinking is on but the model isn't thinking."

FormOpus 4.7 / 4.8Opus 4.6 / Sonnet 4.6Notes
thinking.type: adaptive✓ recommendedModel decides how much to think based on task complexity
output_config.effortlow/medium/high/xhigh/max; must live in output_config, not inside thinking or it's a ValidationException. xhigh is 4.7/4.8 only, and GA
thinking.type: enabled + budget_tokensdeprecatedstill worksbudget is deprecated; use adaptive on 4.7/4.8
A version-specific gotcha

On older LiteLLM versions, sending the deprecated {type: "enabled", budget_tokens: N} to Opus 4.7/4.8 returns 200 plus plain text with no thinking block, and no error. This is fixed in LiteLLM v1.88.1. To be safe, always use adaptive for 4.7/4.8.

One more detail on the response side: Opus 4.8/4.7 default to omitted summary mode, so the thinking block's text field is empty and the full reasoning is encrypted in the signature field for multi-turn continuation. A client reading an empty thinking text is normal; just pass the block back verbatim on the next turn.

Timeouts & the Load Balancer

Timeout alignment: the load balancer's default must go up

This is the most overlooked and most failure-prone spot after go-live. LiteLLM has its own request_timeout, but what cuts a conversation off first is usually not LiteLLM, it's the load balancer in front of it. Once a customer's conversation runs long (extended thinking, large output, multi-turn agents), tens of seconds is common, and the ALB's idle timeout defaults to just 60 seconds. If no new data flows for 60 seconds, the ALB drops the connection; the client sees a broken request or a 504, while LiteLLM is still waiting on the model perfectly normally. It's a confusing thing to debug, because LiteLLM's logs show no error.

The fix is to raise every timeout in the path to cover the longest request, and align them. Use an ingress annotation to take the ALB idle timeout from the default 60s up to 600s (or sized to your longest expected conversation), and set LiteLLM's request_timeout to the same value. Only when the two match is behavior predictable; otherwise the shortest layer always fires first, and you'll set 600 yet still get cut at 60. This is also one reason the gateway sits behind an ALB rather than CloudFront: CloudFront's origin response timeout tops out at 120 seconds, which can't cover long conversations.

If the gateway sits behind a self-hosted Nginx instead of an ALB, the same applies: Nginx's proxy_read_timeout / proxy_send_timeout also default to 60 seconds and must be raised together, or Nginx cuts first. Streaming doesn't get you around this, because idle timeout measures the gap between data chunks: if the model thinks for a long time before the first token (common with extended thinking), that silence can hit the idle timeout. So the wait before the first token is where you most need headroom.

LayerSettingDefaultSuggested
ALBidle_timeout.timeout_seconds (ingress annotation)60s600s
Nginx (self-hosted)proxy_read_timeout / proxy_send_timeout60s600s
LiteLLMrequest_timeout (config file)600s
Cost · Observability

Cost tracking and observability

Cost tracking

LiteLLM's spend log converts each request's cost using a built-in cost map. When a model is brand new, the cost map often hasn't picked it up yet and the spend log computes a cost of zero. The stopgap is to attach input_cost_per_token / output_cost_per_token on the model config as custom prices; once a LiteLLM update adds the pricing (v1.88.1 added Opus 4.8, for instance), the custom price can be removed, and leaving it in does no harm. For attributing cost to a specific project or claiming AWS MAP credits, tag calls with the Application Inference Profile mentioned earlier.

Observability

Install the CloudWatch Observability add-on on EKS, which ships two DaemonSets: the CloudWatch Agent collects Pod/node CPU, memory, and network (into Container Insights), and Fluent Bit forwards container stdout/stderr into CloudWatch Logs (30-day retention). On the LiteLLM side, two environment variables give structured logs: LITELLM_LOG=INFO (per request: model, routing decision, HTTP status, token usage) and LITELLM_DETAILED_TIMING=true (per-stage timing). For hard cases, switch temporarily to LITELLM_LOG=DEBUG to see full request/response bodies; DEBUG has a performance cost, so set it back to INFO when done. Logs land in the /aws/containerinsights/<cluster>/application log group, where Logs Insights is the easiest way to query.

Pre-Production Checklist

Before going to production

Each layer checks for different things: run through the items common to every deployment first, then add the ones for whichever layers the customer uses.

Common · check on every deployment

From L2 · when going private in-region

Add at L3 · cross-region private

Add at L4 · cross-account

The value here isn't "how to install LiteLLM" — there's official documentation for that. It's in breaking the network-isolation and account-boundary requirements into four progressive layers of config: the customer goes exactly as far as they need, and each layer maps cleanly to the LiteLLM parameters and AWS resources it touches.