Turn Bedrock into a production model gateway,
hardened layer by layer
Customers want a single OpenAI / Anthropic-compatible entry point backed by Amazon Bedrock, with central control over keys, cost, and rate limits, and without touching each vendor's SDK. LiteLLM sits exactly there. The hard part isn't installing LiteLLM. It's that as the customer's requirements for network isolation and account boundaries grow, the model config has to grow with them, one layer at a time. The four layers below: configure exactly as far as the customer needs.
Validate the minimal path: one Pod, public egress
Private networking, cross-region, and cross-account can all be introduced later. The fastest way to validate is to let one Pod with internet egress call Bedrock's public endpoint directly, get the full "client → gateway → Bedrock" path working, then harden it layer by layer. Three steps follow.
- Configure one model. Put a model ID and region in
model_list; let EKS Pod Identity inject credentials, with no access key written anywhere. - Point the client here. Have the app or Claude Code set its base URL to the gateway and authenticate with a LiteLLM virtual key.
- Raise the timeouts up front. Set the load balancer and LiteLLM timeouts to 600 seconds from the start, or long conversations get cut off mid-stream (the single most common failure, covered in its own section below).
model_list: - model_name: claude-sonnet-4-6 litellm_params: model: bedrock/global.anthropic.claude-sonnet-4-6 aws_region_name: ap-northeast-1 drop_params: true
Raise every timeout in the path to 600 seconds and align them. The ALB's idle timeout defaults to just 60 seconds, while long conversations (extended thinking, multi-turn agents) routinely run tens of seconds to minutes; once 60 seconds pass with no new data, the connection is dropped while LiteLLM is still waiting on the model perfectly normally, which is confusing to debug. See the Timeouts & LB section.
This minimal path is simply layer one of the four below. Once it's working, look at what LiteLLM does in this architecture and how the whole thing is structured, then return to the four layers to harden it.
LiteLLM's role in this architecture
Many people first hear of LiteLLM as an SDK you drop into code to call models. It does have an SDK form, but this architecture uses the other side of it: the Proxy Server, a standalone service process that exposes two standard HTTP interfaces, /v1/chat/completions (OpenAI format) and /v1/messages (Anthropic format), and internally translates requests into each vendor's API. Sitting between the customer and Bedrock, it solves four very concrete problems.
The customer's apps, Claude Code, and scripts all point at the same endpoint with the same key. Switching models, adding models, tuning routing all happen in the gateway config; the client never changes a line.
Bedrock IAM credentials and cross-account AssumeRole stay locked inside the gateway Pod. Clients only ever see a LiteLLM virtual key, never AWS credentials; revoking a customer is just deleting a key.
LiteLLM's built-in spend log records, per request, which model ran, how many tokens, and the dollar cost, all persisted to a database. Who uses the most and which model costs the most is right there.
Claude on Bedrock differs from the first-party API in call format, thinking parameters, and beta headers. The gateway absorbs those differences so clients just send the standard format.
The full picture
The diagram below shows all four config layers stacked together. The path comes in from the client, through the ALB to the LiteLLM Pods on EKS, then out to Bedrock along three different routes depending on the model, with account state persisted in Aurora.
A few design choices worth calling out. EKS over a single EC2 is about availability: if the gateway goes down, every customer's access drops at once, which is unacceptable. Two replicas roll over during updates, so upgrading the LiteLLM version causes zero downtime; a single EC2 is kept as a cold standby to take over on a full cluster failure. The ALB locks inbound IPs because what sits behind it is billable Bedrock traffic. An open 0.0.0.0/0 security group would expose a high-cost entry point to the internet, and a leaked virtual key would let anyone call it, so only the customer's known IPs are allowed. That's a hard rule.
A common follow-up: why an ALB out front, rather than a CloudFront layer like an ordinary web service? Two reasons. First, timeouts: CloudFront's origin response timeout defaults to 30 seconds and can be raised only to a 120-second ceiling (and that needs a quota request), whereas multi-turn agent and extended-thinking conversations easily run for minutes; CloudFront can't hold a connection that long. An ALB's idle timeout goes up to 4000 seconds, leaving headroom. Second, it's the wrong tool: CloudFront is built for edge-distributing cacheable content, but gateway traffic is all authenticated, all-distinct POST requests with nothing to cache, so adding it only buys an extra hop of latency and cost. A layer-7 load balancer (ALB) is the right fit in front of the gateway.
Pod Identity over IRSA: both inject AWS credentials into Pods, but Pod Identity is simpler to set up and natively supports attaching transitive session tags, a property that matters in the layer-4 cross-account case. Aurora PostgreSQL Serverless v2 backs store_model_in_db and the spend log, with both Pods sharing one record set; it scales with load and costs little when gateway traffic is light. The Pods are deliberately small (250m CPU / 1Gi memory, limits 500m / 2Gi): the LiteLLM proxy is IO-bound, so the bottleneck is network and concurrent connections, not CPU. The securityContext is tightened to least privilege, dropping all Linux capabilities and disabling privilege escalation.
Pod has internet egress, hit Bedrock's public endpoint
The simplest case: the Pod has internet egress. A model then needs only two things, a model ID and a region. Credentials are injected automatically by EKS Pod Identity (using the workload account's own permissions), so LiteLLM never configures an access key.
model_list: - model_name: claude-sonnet-4-6 litellm_params: model: bedrock/global.anthropic.claude-sonnet-4-6 aws_region_name: ap-northeast-1 drop_params: true
drop_params: true discards OpenAI parameters Bedrock doesn't support, avoiding a 400. Worth explaining the model ID forms here, since they are Bedrock's rules and getting them wrong fails outright: the global.* prefix in global.anthropic.claude-sonnet-4-6 is a cross-region inference profile, where Bedrock routes the request to whichever region has capacity. It has the best availability and is AWS's default recommendation. It must be called through an inference profile, never a bare base model ID.
At this layer the customer can already use it. But traffic rides Bedrock's public endpoint, and many customers' security and compliance won't clear that.
No internet egress, route through an in-region VPC Endpoint
To tighten security, the goal is a Pod with no internet access at all. The way to do it is to create a Bedrock VPC Endpoint (VPCE) in the Pod's region so traffic stays on AWS's private network the whole way. The config adds just one line, aws_bedrock_runtime_endpoint, pointing at that VPCE.
- model_name: claude-sonnet-4-6 litellm_params: model: bedrock/global.anthropic.claude-sonnet-4-6 aws_region_name: ap-northeast-1 aws_bedrock_runtime_endpoint: https://vpce-xxxxx.bedrock-runtime.ap-northeast-1.vpce.amazonaws.com drop_params: true
Three things to prepare on the AWS side: create the com.amazonaws.<region>.bedrock-runtime interface endpoint in the workload VPC with Private DNS enabled; allow inbound 443 from the Pod subnet on the VPCE security group; and remove the internet route from the Pod subnet (or simply place it in a private subnet) so Bedrock is reachable only through the VPCE.
At this layer the VPCE and the Pod live in the same VPC, so Private DNS works and the default service hostname would actually resolve to the VPCE too. Writing the VPCE hostname explicitly in the config mainly makes the traffic path obvious, and sets up the habit for the cross-VPC case in layer 3.
Reach a US Inference Profile cross-region, fully private
AWS recommends global.* by default, but some customers want the inference entry pinned to the US, using a US Inference Profile (the us.* prefix), while still keeping everything off the public internet. The reason is that global.* routes globally and can show occasional backend latency swings; pinning to us.* (say us-west-2) gives steadier behavior. But a us.* call's entry point must land in a US region; you can't dispatch it from an APAC endpoint.
To reach us-west-2 Bedrock cross-region and stay fully private, set up a cross-region VPC Peering between the workload VPC (say Tokyo) and a us-west-2 VPC, put the Bedrock VPCE on the us-west-2 side, and let traffic cross privately over the peering.
- model_name: claude-opus-4-8-us litellm_params: model: bedrock/us.anthropic.claude-opus-4-8 aws_region_name: us-west-2 aws_bedrock_runtime_endpoint: https://vpce-usw2-xxxxx.bedrock-runtime.us-west-2.vpce.amazonaws.com drop_params: true
Three things in the network layer have to be right:
- The endpoint must be the VPCE's own hostname (the one starting with
vpce-). A VPCE's Private DNS only takes effect inside the VPC that created it; it does not propagate across peering. The Tokyo Pod resolving the defaultbedrock-runtime.us-west-2.amazonaws.comwould get a public IP and never reach the VPCE. Only hard-coding this specific hostname resolves to the VPCE's private IP and routes correctly over the peering. aws_region_namemust match the VPCE's region (hereus-west-2). The AWS SDK signs requests for that region, and a mismatch fails the signature.- Route tables and security groups on both sides must be complete. On both the Tokyo and us-west-2 sides, the Pod subnet's route table needs a route to the peer CIDR with the peering connection as target; the us-west-2 Bedrock VPCE security group must allow inbound 443 from the Tokyo VPC CIDR.
Set expectations on latency: a trans-Pacific peering hop adds roughly 100–150ms over an in-region VPCE, mostly in time-to-first-token. Once the stream's connection is established, later tokens are barely affected. To keep flexibility, configure us.* as a fallback for global.*: run global normally and switch over only when needed.
Many AWS accounts reach Bedrock, managed through one LiteLLM
If a customer has several AWS accounts that all need Bedrock but wants one gateway to manage them, issue keys, and bill centrally, the answer is cross-account AssumeRole: the gateway Pod first assumes a role in the target account, then calls that account's Bedrock with the temporary credentials. Each account's Bedrock bill stays separate, which keeps reconciliation clean. The config adds one line, aws_role_name, pointing at the target account's cross-account role.
- model_name: claude-sonnet-4-6-acct-b litellm_params: model: bedrock/global.anthropic.claude-sonnet-4-6 aws_region_name: ap-northeast-1 aws_role_name: arn:aws:iam::<ACCOUNT_B>:role/LiteLLM-Bedrock-CrossAccount-Role aws_session_name: bedrock-session aws_bedrock_runtime_endpoint: https://vpce-xxxxx.bedrock-runtime.ap-northeast-1.vpce.amazonaws.com drop_params: true
IAM has to be configured in both accounts. The workload account's Pod Role permission policy adds sts:AssumeRole on the target role, and its trust policy allows pods.eks.amazonaws.com to assume it (the standard EKS Pod Identity trust principal). The target account holds a cross-account role whose trust policy allows the workload account's Pod Role to assume it, with a permission policy granting Bedrock calls.
Both policies must include sts:TagSession, paired with sts:AssumeRole. When EKS Pod Identity injects credentials, it attaches a set of transitive session tags, and those tags are carried along when the Pod assumes the cross-account role. If the target account's trust policy allows only sts:AssumeRole without sts:TagSession, the call returns AccessDenied outright.
If the customer also requires STS itself to be private (no public AssumeRole), just add an STS VPC Endpoint (with Private DNS) in the Pod's region. AssumeRole is initiated by the Pod locally, so the in-region STS VPCE is hit naturally, and there's no need to create one in the peer region.
Stacking the layers, plus two special model IDs
The four layers are orthogonal and combine freely as needed: cross-region US profile plus cross-account plus fully private is just layers 2, 3, and 4 stacked, which is the architecture diagram above. Beyond global.* and us.*, two more model ID forms come up.
Models like GLM and Kimi don't support cross-region inference profiles, so conversely they must use a bare model ID with no global./us. prefix, called through the bedrock/converse/ path. As a consequence, the Pod's IAM policy must add a foundation-model ARN per model; the wildcard ARN for the Claude family doesn't cover them.
For fine-grained cost attribution (e.g. AWS MAP credits), an AIP tags calls so they can be tracked, using local credentials. The constraint: an AIP can only wrap a base model that actually exists in a given region, not a global.* cross-region profile, so whether you can use one depends on whether the target region has that model's regional base model.
A few global settings worth reusing
litellm_settings: drop_params: true # drop params Bedrock rejects, avoid 400 request_timeout: 600 # headroom for long inference num_retries: 2 # retry transient failures fallbacks: # per-model downgrade chain on call failure - claude-opus-4-6: [claude-opus-4-5, claude-sonnet-4-5] - claude-sonnet-4-6: [claude-sonnet-4-5] context_window_fallbacks: # switch to a larger-window variant when over limit - claude-sonnet-4-5: [claude-4-5-sonnet-1M] general_settings: store_model_in_db: true store_prompts_in_spend_logs: true
fallbacks and context_window_fallbacks are two different things: the former reroutes to a backup model when a call fails (throttling, errors) to preserve availability; the latter switches to a larger-window variant when the request's context exceeds the current model's limit (say from a 200K Sonnet to a 1M Sonnet). One handles "the call failed," the other handles "it doesn't fit."
Pointing local Claude Code at the gateway
Many customers' developers use Claude Code locally, and it connects to the first-party Anthropic API by default. Routing it through the self-hosted gateway takes two key steps: point it at the gateway URL, then feed in the virtual key with an apiKeyHelper script. Here's the easy trap: setting only an ANTHROPIC_AUTH_TOKEN in env often won't work. A static token is sent only as the Authorization header, but LiteLLM's virtual-key check reads x-api-key; an apiKeyHelper's output is sent as both the Authorization and X-Api-Key headers, which is what makes the setup actually authenticate. Claude Code talks to the /v1/messages (Anthropic format) entry point.
{
"apiKeyHelper": "~/.claude/litellm-key.sh",
"env": {
"ANTHROPIC_BASE_URL": "https://<your LiteLLM gateway>",
"ANTHROPIC_DEFAULT_OPUS_MODEL": "claude-opus-4-8",
"ANTHROPIC_DEFAULT_SONNET_MODEL": "claude-sonnet-4-6",
"ANTHROPIC_DEFAULT_HAIKU_MODEL": "claude-haiku-4-5"
}
}
#!/bin/bash # Simplest: just echo the virtual key echo "<your LiteLLM virtual key>"
Each setting does one job. apiKeyHelper points to a script that prints the virtual key; Claude Code runs it at startup to get the key and sends it as both the Authorization and X-Api-Key headers, and without it a bare env token often fails to authenticate. ANTHROPIC_BASE_URL routes every request to the gateway ALB instead of straight to Anthropic, and the key is a virtual key, not AWS credentials, so the client never sees the underlying Bedrock permissions. The three ANTHROPIC_DEFAULT_*_MODEL values map Claude Code's built-in opus / sonnet / haiku tiers to the model_name entries in the LiteLLM model_list; they must match exactly, or the gateway can't find the model when you switch tiers.
In a session, /model sonnet and /model opus switch tiers instantly, with no client code changes. For rotating keys, change the script to pull the key from a vault and set a refresh interval with CLAUDE_CODE_API_KEY_HELPER_TTL_MS. To check the link works, send a single curl to the gateway's /v1/messages and confirm it returns 200.
Thinking parameters split by model generation
Claude's extended thinking on Bedrock takes a parameter format that varies by model generation, and getting it wrong produces "looks like thinking is on but the model isn't thinking."
| Form | Opus 4.7 / 4.8 | Opus 4.6 / Sonnet 4.6 | Notes |
|---|---|---|---|
| thinking.type: adaptive | ✓ recommended | ✓ | Model decides how much to think based on task complexity |
| output_config.effort | ✓ | ✓ | low/medium/high/xhigh/max; must live in output_config, not inside thinking or it's a ValidationException. xhigh is 4.7/4.8 only, and GA |
| thinking.type: enabled + budget_tokens | deprecated | still works | budget is deprecated; use adaptive on 4.7/4.8 |
On older LiteLLM versions, sending the deprecated {type: "enabled", budget_tokens: N} to Opus 4.7/4.8 returns 200 plus plain text with no thinking block, and no error. This is fixed in LiteLLM v1.88.1. To be safe, always use adaptive for 4.7/4.8.
One more detail on the response side: Opus 4.8/4.7 default to omitted summary mode, so the thinking block's text field is empty and the full reasoning is encrypted in the signature field for multi-turn continuation. A client reading an empty thinking text is normal; just pass the block back verbatim on the next turn.
Timeout alignment: the load balancer's default must go up
This is the most overlooked and most failure-prone spot after go-live. LiteLLM has its own request_timeout, but what cuts a conversation off first is usually not LiteLLM, it's the load balancer in front of it. Once a customer's conversation runs long (extended thinking, large output, multi-turn agents), tens of seconds is common, and the ALB's idle timeout defaults to just 60 seconds. If no new data flows for 60 seconds, the ALB drops the connection; the client sees a broken request or a 504, while LiteLLM is still waiting on the model perfectly normally. It's a confusing thing to debug, because LiteLLM's logs show no error.
The fix is to raise every timeout in the path to cover the longest request, and align them. Use an ingress annotation to take the ALB idle timeout from the default 60s up to 600s (or sized to your longest expected conversation), and set LiteLLM's request_timeout to the same value. Only when the two match is behavior predictable; otherwise the shortest layer always fires first, and you'll set 600 yet still get cut at 60. This is also one reason the gateway sits behind an ALB rather than CloudFront: CloudFront's origin response timeout tops out at 120 seconds, which can't cover long conversations.
If the gateway sits behind a self-hosted Nginx instead of an ALB, the same applies: Nginx's proxy_read_timeout / proxy_send_timeout also default to 60 seconds and must be raised together, or Nginx cuts first. Streaming doesn't get you around this, because idle timeout measures the gap between data chunks: if the model thinks for a long time before the first token (common with extended thinking), that silence can hit the idle timeout. So the wait before the first token is where you most need headroom.
| Layer | Setting | Default | Suggested |
|---|---|---|---|
| ALB | idle_timeout.timeout_seconds (ingress annotation) | 60s | 600s |
| Nginx (self-hosted) | proxy_read_timeout / proxy_send_timeout | 60s | 600s |
| LiteLLM | request_timeout (config file) | — | 600s |
Cost tracking and observability
Cost tracking
LiteLLM's spend log converts each request's cost using a built-in cost map. When a model is brand new, the cost map often hasn't picked it up yet and the spend log computes a cost of zero. The stopgap is to attach input_cost_per_token / output_cost_per_token on the model config as custom prices; once a LiteLLM update adds the pricing (v1.88.1 added Opus 4.8, for instance), the custom price can be removed, and leaving it in does no harm. For attributing cost to a specific project or claiming AWS MAP credits, tag calls with the Application Inference Profile mentioned earlier.
Observability
Install the CloudWatch Observability add-on on EKS, which ships two DaemonSets: the CloudWatch Agent collects Pod/node CPU, memory, and network (into Container Insights), and Fluent Bit forwards container stdout/stderr into CloudWatch Logs (30-day retention). On the LiteLLM side, two environment variables give structured logs: LITELLM_LOG=INFO (per request: model, routing decision, HTTP status, token usage) and LITELLM_DETAILED_TIMING=true (per-stage timing). For hard cases, switch temporarily to LITELLM_LOG=DEBUG to see full request/response bodies; DEBUG has a performance cost, so set it back to INFO when done. Logs land in the /aws/containerinsights/<cluster>/application log group, where Logs Insights is the easiest way to query.
Before going to production
Each layer checks for different things: run through the items common to every deployment first, then add the ones for whichever layers the customer uses.
Common · check on every deployment
- Never open
0.0.0.0/0inbound on the ALB security group; allow only the customer's known IPs. - Raise and align every timeout in the path: ALB / Nginx default to 60s and will cut long conversations; set them all to 600s.
- Clients only ever get a virtual key; AWS credentials stay locked in the gateway Pod, never handed to the client.
- Always use
thinking: adaptivefor Opus 4.7/4.8; don't send the deprecatedbudget_tokens. - For a primary model using server-side tools (e.g. web search), turn off
drop_params, or the tool definition gets stripped. - Use bare IDs through
bedrock/converse/for open-weight models, and add their ARNs one by one in IAM. - Before configuring AIP cost tracking, confirm the target region has the model's regional base model; an AIP can't wrap
global.*.
From L2 · when going private in-region
- Create an in-region Bedrock VPCE with Private DNS, and remove the internet route from the Pod subnet.
- Allow inbound 443 from the Pod subnet on the VPCE security group.
Add at L3 · cross-region private
- The endpoint must be the VPCE's own hostname; Private DNS does not propagate across VPCs, and the default hostname resolves to a public IP.
aws_region_namemust match the VPCE's region, or the SDK signature fails.- On both sides, add a route to the peer CIDR via the peering connection, and allow the peer VPC CIDR in the security group.
Add at L4 · cross-account
- Both IAM policies must carry
sts:TagSession, paired withsts:AssumeRole, or you get AccessDenied. - The target account's cross-account role trusts the workload account's Pod Role, with a permission policy granting Bedrock calls.
The value here isn't "how to install LiteLLM" — there's official documentation for that. It's in breaking the network-isolation and account-boundary requirements into four progressive layers of config: the customer goes exactly as far as they need, and each layer maps cleanly to the LiteLLM parameters and AWS resources it touches.