December 19, 2025
When building production applications that require multiple AI models, you face a critical architectural decision: how to manage model routing, fallbacks, and provider abstraction without creating technical debt.
This guide examines two approaches—AWS Bedrock's fully managed solution and LiteLLM's open-source gateway—and provides a complete production deployment for LiteLLM when it's the right choice.
Your application needs different models for different tasks: Claude Sonnet for complex reasoning, GPT-4 for specific capabilities, Haiku for high-throughput simple tasks. Managing multiple provider SDKs creates several problems:
# OpenAI
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}]
)
# Anthropic (different structure)
response = anthropic.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello"}]
)
# AWS Bedrock (entirely different SDK)
response = bedrock.invoke_model(
modelId="anthropic.claude-3-sonnet-20240229-v1:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 1024,
"messages": [{"role": "user", "content": "Hello"}]
})
)
Operational complexity: Three billing systems, three sets of credentials, inconsistent error handling, and code changes required to switch models.
Cost visibility: Without unified tracking, understanding which models drive costs becomes difficult.
AWS Bedrock should be your default choice for multi-model AI deployments. It provides native AWS integration with several advantages:
Fully managed infrastructure: No containers, no database maintenance, no gateway upgrades. AWS handles availability, scaling, and security patches.
Available models: Claude (Anthropic), Llama (Meta), Mistral, Cohere, Titan (Amazon), Jurassic (AI21), and others—all accessible through a single API.
Unified billing and cost allocation: Native AWS Cost Explorer integration, resource tagging for cost attribution, and consolidated billing across all models.
Security and compliance: VPC endpoints for private connectivity, AWS IAM for access control, encryption at rest and in transit, and compliance certifications (SOC, HIPAA, GDPR).
Built-in capabilities: Model evaluation tools, knowledge bases, agents framework, and guardrails for content filtering.
Model availability: Not all providers are available. If you need models outside Bedrock's catalog (newer GPT versions, specialized models, or providers not yet integrated), Bedrock isn't an option.
Cost structure: Bedrock pricing includes AWS markup over direct provider pricing. For high-volume applications, this can represent a high additional cost.
Flexibility constraints: Limited control over caching strategies, routing logic, and fallback behavior compared to self-managed solutions.
Regional availability: Some models are only available in specific AWS regions, potentially increasing latency for global deployments.
• All required models are available in Bedrock
• You prioritize operational simplicity over cost optimization
• AWS-native compliance and security features are requirements
• Your organization standardizes on AWS managed services
LiteLLM is appropriate when:
1. Required models aren't in Bedrock: You need OpenAI GPT-4, specific versions, or providers not yet available in Bedrock
2. Cost optimization is critical: Direct provider pricing without AWS markup matters for high-volume workloads
3. Advanced routing logic: You need sophisticated fallback chains, custom load balancing, or A/B testing between models
4. Caching requirements: Aggressive caching strategies to reduce costs and latency beyond what Bedrock offers
LiteLLM provides a unified OpenAI-compatible API that translates requests to any provider. It's a proven open-source solution (used in production by many organizations) that runs on your infrastructure.
[Application] → [ALB] → [ECS Tasks] → [AI Providers]
↓
[RDS PostgreSQL] ← Logs, config, analytics
[ElastiCache Redis] ← Caching, rate limiting
[Secrets Manager] ← API keys
# Same format for any provider
response = requests.post(
"https://your-gateway.com/v1/chat/completions",
headers={"Authorization": f"Bearer {api_key}"},
json={
"model": "claude-3-sonnet", # or "gpt-4", or "bedrock/claude-sonnet"
"messages": [{"role": "user", "content": "Hello"}]
}
)
• Request/response caching to reduce costs
• Automatic fallbacks when primary models fail
• Load balancing across multiple API keys
• Detailed cost tracking and budget alerts
• Usage analytics per user, team, or feature
Infrastructure cost: Approximately $80-150/month for moderate usage (1-5M requests), scaling with traffic.
This deployment is production-ready with proper security, monitoring, and scalability. Budget approximately 60-90 minutes for complete setup, including testing.
• AWS Account with appropriate IAM permissions
• AWS CLI configured: aws configure
• Docker installed locally
• Domain name (optional, can use ALB DNS)
• API keys for required providers
LiteLLM requires a database for configuration, request logs, and analytics.
aws ec2 create-security-group \
--group-name litellm-rds-sg \
--description "LiteLLM RDS security group" \
--vpc-id vpc-12345678
aws rds create-db-instance \
--db-instance-identifier litellm-db \
--db-instance-class db.t3.micro \
--engine postgres \
--engine-version 15.4 \
--master-username litellm \
--master-user-password 'YourSecurePassword123!' \
--allocated-storage 20 \
--vpc-security-group-ids sg-rds12345678 \
--db-subnet-group-name your-db-subnet-group \
--backup-retention-period 7 \
--no-publicly-accessible \
--storage-encrypted
aws rds wait db-instance-available --db-instance-identifier litellm-db
aws rds describe-db-instances \
--db-instance-identifier litellm-db \
--query 'DBInstances[0].Endpoint.Address' \
--output text
Redis handles request caching and distributed rate limiting across ECS tasks.
aws ec2 create-security-group \
--group-name litellm-redis-sg \
--description "LiteLLM Redis security group" \
--vpc-id vpc-12345678
aws elasticache create-cache-cluster \
--cache-cluster-id litellm-redis \
--cache-node-type cache.t3.micro \
--engine redis \
--num-cache-nodes 1 \
--security-group-ids sg-redis12345678 \
--cache-subnet-group-name your-cache-subnet-group
aws elasticache wait cache-cluster-available --cache-cluster-id litellm-redis
aws elasticache describe-cache-clusters \
--cache-cluster-id litellm-redis \
--show-cache-node-info \
--query 'CacheClusters[0].CacheNodes[0].Endpoint.Address' \
--output text
Store all sensitive credentials in AWS Secrets Manager.
# OpenAI
aws secretsmanager create-secret \
--name litellm/openai-api-key \
--secret-string "sk-proj-your-openai-key"
# Anthropic
aws secretsmanager create-secret \
--name litellm/anthropic-api-key \
--secret-string "sk-ant-your-anthropic-key"
# Master key for client authentication
aws secretsmanager create-secret \
--name litellm/master-key \
--secret-string "sk-litellm-$(openssl rand -hex 32)"
# Database password
aws secretsmanager create-secret \
--name litellm/db-password \
--secret-string "YourSecurePassword123!"
Create litellm-config.yaml defining models, routing, and policies:
model_list:
# OpenAI Models (not available in Bedrock)
- model_name: gpt-4
litellm_params:
model: gpt-4
api_key: os.environ/OPENAI_API_KEY
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
api_key: os.environ/OPENAI_API_KEY
# Anthropic Direct (better pricing than Bedrock)
- model_name: claude-3-opus
litellm_params:
model: claude-3-opus-20240229
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: claude-3-sonnet
litellm_params:
model: claude-3-sonnet-20240229
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: claude-3-sonnet
litellm_params:
model: claude-3-sonnet-20240229
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: claude-3-haiku
litellm_params:
model: claude-3-haiku-20240307
api_key: os.environ/ANTHROPIC_API_KEY
# AWS Bedrock (for models where managed service is preferred)
- model_name: bedrock-claude-sonnet
litellm_params:
model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
aws_region_name: us-east-1
- model_name: bedrock-claude-haiku
litellm_params:
model: bedrock/anthropic.claude-3-haiku-20240307-v1:0
aws_region_name: us-east-1
# Routing and resilience
litellm_settings:
# Automatic fallbacks
fallbacks:
- model: claude-3-sonnet
fallback: [gpt-4, claude-3-opus]
- model: gpt-4
fallback: [claude-3-opus, bedrock-claude-sonnet]
# Caching configuration
cache:
type: redis
host: ${REDIS_HOST}
port: 6379
ttl: 3600 # 1 hour cache
# Rate limiting
max_requests_per_minute: 100
# Budget controls
max_budget: 1000 # $1000/month
budget_duration: 30d
# Database and authentication
general_settings:
database_url: postgresql://litellm:${DB_PASSWORD}@${DB_HOST}:5432/litellm
master_key: ${LITELLM_MASTER_KEY}
# Logging configuration
log_requests: true
log_response_body: false # Privacy: don't log full responses
This configuration will be mounted into ECS tasks.
aws ecs create-cluster --cluster-name litellm-cluster
aws ec2 create-security-group \
--group-name litellm-ecs-sg \
--description "LiteLLM ECS tasks" \
--vpc-id vpc-12345678
Save as ecs-trust-policy.json:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"Service": "ecs-tasks.amazonaws.com"},
"Action": "sts:AssumeRole"
}]
}
Create role and attach policies:
aws iam create-role \
--role-name litellm-ecs-execution-role \
--assume-role-policy-document file://ecs-trust-policy.json
aws iam attach-role-policy \
--role-name litellm-ecs-execution-role \
--policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy
aws iam attach-role-policy \
--role-name litellm-ecs-execution-role \
--policy-arn arn:aws:iam::aws:policy/SecretsManagerReadWrite
aws iam create-role \
--role-name litellm-ecs-task-role \
--assume-role-policy-document file://ecs-trust-policy.json
aws iam attach-role-policy \
--role-name litellm-ecs-task-role \
--policy-arn arn:aws:iam::aws:policy/AmazonBedrockFullAccess
Save as litellm-task-definition.json:
{
"family": "litellm-task",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"executionRoleArn": "arn:aws:iam::123456789012:role/litellm-ecs-execution-role",
"taskRoleArn": "arn:aws:iam::123456789012:role/litellm-ecs-task-role",
"containerDefinitions": [{
"name": "litellm",
"image": "ghcr.io/berriai/litellm:main-latest",
"portMappings": [{
"containerPort": 4000,
"protocol": "tcp"
}],
"environment": [
{"name": "DB_HOST", "value": "litellm-db.xxxxxxxxx.us-east-1.rds.amazonaws.com"},
{"name": "REDIS_HOST", "value": "litellm-redis.xxxxxx.0001.use1.cache.amazonaws.com"},
{"name": "AWS_REGION_NAME", "value": "us-east-1"}
],
"secrets": [
{"name": "OPENAI_API_KEY", "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:litellm/openai-api-key"},
{"name": "ANTHROPIC_API_KEY", "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:litellm/anthropic-api-key"},
{"name": "LITELLM_MASTER_KEY", "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:litellm/master-key"},
{"name": "DB_PASSWORD", "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:litellm/db-password"}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/litellm",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
},
"command": ["--config", "/app/config.yaml", "--port", "4000"]
}]
}
Register task definition:
aws ecs register-task-definition --cli-input-json file://litellm-task-definition.json
aws elbv2 create-load-balancer \
--name litellm-alb \
--subnets subnet-12345678 subnet-87654321 \
--security-groups sg-alb12345678 \
--scheme internet-facing \
--type application
aws elbv2 create-target-group \
--name litellm-tg \
--protocol HTTP \
--port 4000 \
--vpc-id vpc-12345678 \
--target-type ip \
--health-check-path /health \
--health-check-interval-seconds 30
aws elbv2 create-listener \
--load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/litellm-alb/xxxxx \
--protocol HTTPS \
--port 443 \
--certificates CertificateArn=arn:aws:acm:us-east-1:123456789012:certificate/xxxxx \
--default-actions Type=forward,TargetGroupArn=arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/litellm-tg/xxxxx
aws ecs create-service \
--cluster litellm-cluster \
--service-name litellm-service \
--task-definition litellm-task \
--desired-count 2 \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={subnets=[subnet-12345678,subnet-87654321],securityGroups=[sg-ecs12345678],assignPublicIp=DISABLED}" \
--load-balancers targetGroupArn=arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/litellm-tg/xxxxx,containerName=litellm,containerPort=4000
aws ecs wait services-stable --cluster litellm-cluster --services litellm-service
aws elbv2 describe-load-balancers \
--names litellm-alb \
--query 'LoadBalancers[0].DNSName' \
--output text
Production deployments need auto-scaling to handle traffic variability:
# Register scalable target
aws application-autoscaling register-scalable-target \
--service-namespace ecs \
--resource-id service/litellm-cluster/litellm-service \
--scalable-dimension ecs:service:DesiredCount \
--min-capacity 2 \
--max-capacity 10
# CPU-based scaling policy
aws application-autoscaling put-scaling-policy \
--policy-name litellm-cpu-scaling \
--service-namespace ecs \
--resource-id service/litellm-cluster/litellm-service \
--scalable-dimension ecs:service:DesiredCount \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 70.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ECSServiceAverageCPUUtilization"
},
"ScaleInCooldown": 300,
"ScaleOutCooldown": 60
}'
curl https://your-gateway.com/health
# Expected: {"status": "healthy"}
curl -X POST https://your-gateway.com/v1/chat/completions \
-H "Authorization: Bearer sk-litellm-your-master-key" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-3-haiku",
"messages": [{"role": "user", "content": "Say hello in 5 words"}]
}'
curl -X POST https://your-gateway.com/v1/chat/completions \
-H "Authorization: Bearer sk-litellm-your-master-key" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Hello"}],
"fallbacks": ["claude-3-sonnet", "bedrock-claude-sonnet"]
}'
CloudWatch metrics to track:
# ECS CPU utilization
aws cloudwatch get-metric-statistics \
--namespace AWS/ECS \
--metric-name CPUUtilization \
--dimensions Name=ServiceName,Value=litellm-service \
--start-time 2024-01-01T00:00:00Z \
--end-time 2024-01-01T23:59:59Z \
--period 3600 \
--statistics Average
-- Cost per model (last 7 days)
SELECT
model,
SUM(cost) as total_cost,
COUNT(*) as request_count,
AVG(cost) as avg_cost_per_request
FROM request_logs
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY model
ORDER BY total_cost DESC;
-- Error rates by provider
SELECT
model,
status_code,
COUNT(*) as error_count
FROM request_logs
WHERE status_code >= 400
AND timestamp > NOW() - INTERVAL '24 hours'
GROUP BY model, status_code
ORDER BY error_count DESC;
• Cause: ECS cold starts
• Solution: Maintain minimum 2 tasks, enable ALB connection draining
• Cause: All requests try primary model first
• Solution: Implement client-side load balancing based on rate limit status
• Cause: Prompt variations preventing cache hits
• Solution: Normalize prompts (trim whitespace, normalize formatting) before caching
• Cause: Logging full request/response bodies
• Solution: Set log_response_body: false, implement PostgreSQL log rotation
Use this framework to choose the right approach:
• All required models are available in Bedrock
• Operational simplicity is the priority
• AWS-native compliance features are required
• Organization standardizes on AWS managed services
• Team has limited DevOps resources
• Required models aren't available in Bedrock (OpenAI GPT-4, specific providers)
• Direct provider pricing matters for cost optimization
• Advanced routing logic is needed (complex fallbacks, A/B testing)
• Aggressive caching requirements to reduce costs
• Need detailed analytics and cost tracking beyond AWS Cost Explorer
Many production systems use both:
• Primary: AWS Bedrock for available models (managed operations)
• Secondary: LiteLLM for models not in Bedrock (flexibility)
• Routing: Application-level logic determines which gateway to use
AWS Bedrock should be your default choice for multi-model deployments when it meets your requirements. It eliminates operational complexity and provides native AWS integration.
LiteLLM becomes the right choice when you need models outside Bedrock's catalog, require cost optimization at scale, or need advanced routing capabilities. The deployment outlined here is production-ready with proper security, monitoring, and scalability.
Both approaches solve the same core problem—abstracting away provider-specific APIs—but make different trade-offs between operational simplicity and flexibility. Choose based on your specific requirements, and remember that hybrid approaches combining both solutions are often the most pragmatic for production systems.
We'd love to talk about how we can work together
Take control of your AWS cloud costs that enables you to grow!