Serverless Architecture Lessons: Scaling to 99.8% Uptime
How we built a scalable AWS serverless platform that reduced infrastructure costs by 40% while maintaining high performance.
Serverless Architecture Lessons: Scaling to 99.8% Uptime
When I started building ReachGig's infrastructure, I made a bet on serverless architecture that many questioned. "Serverless isn't ready for production," they said. "You'll hit cold start problems," they warned. Eighteen months later, our AWS serverless platform achieved 99.8% uptime while reducing infrastructure costs by 40%. Here's what I learned about building production-grade serverless systems.
The Serverless Decision: Why We Chose AWS Lambda
The decision to go serverless wasn't driven by hype—it was driven by practical needs:
Resource Efficiency
As a startup, we needed to optimize for:
- Variable workloads: User activity fluctuated dramatically throughout the day
- Cost efficiency: Pay-per-use pricing aligned with our limited runway
- Development speed: Focus on product features, not infrastructure management
Scalability Requirements
We anticipated rapid growth and needed architecture that could:
- Scale automatically: Handle traffic spikes without manual intervention
- Scale to zero: Minimize costs during low-traffic periods
- Global distribution: Serve users worldwide with low latency
Team Constraints
With a small engineering team, we needed:
- Minimal operational overhead: No server management or patching
- Fast deployment cycles: Rapid iteration and feature releases
- Built-in monitoring: Comprehensive observability without additional setup
Architecture Overview: Our Serverless Stack
Our final architecture combined multiple AWS services into a cohesive system:
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ CloudFront │────│ API Gateway │────│ Lambda Funcs │
│ (Global CDN) │ │ (API Routing) │ │ (Business Logic)│
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
┌──────────────────┐ │
│ DynamoDB │◄────────────┘
│ (Primary DB) │
└──────────────────┘
│
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ S3 │ │ EventBridge │────│ SQS/SNS │
│ (File Storage) │ │ (Event Routing) │ │ (Messaging) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
Core Components
// API Gateway + Lambda integration
export const apiHandler = async (event: APIGatewayEvent): Promise<APIGatewayResponse> => {
try {
// Parse request
const { httpMethod, path, body } = event;
const requestData = body ? JSON.parse(body) : {};
// Route to appropriate handler
const result = await routeRequest(httpMethod, path, requestData);
return {
statusCode: 200,
headers: {
'Content-Type': 'application/json',
'Access-Control-Allow-Origin': '*'
},
body: JSON.stringify(result)
};
} catch (error) {
console.error('API Error:', error);
return {
statusCode: 500,
body: JSON.stringify({ error: 'Internal server error' })
};
}
};
// DynamoDB operations
class DynamoDBService {
private client: DynamoDBClient;
constructor() {
this.client = new DynamoDBClient({ region: 'us-east-1' });
}
async createItem(tableName: string, item: any): Promise<any> {
const command = new PutItemCommand({
TableName: tableName,
Item: marshall(item),
ConditionExpression: 'attribute_not_exists(id)'
});
return await this.client.send(command);
}
async getItem(tableName: string, key: any): Promise<any> {
const command = new GetItemCommand({
TableName: tableName,
Key: marshall(key)
});
const result = await this.client.send(command);
return result.Item ? unmarshall(result.Item) : null;
}
}
Lesson 1: Cold Starts Are Manageable
The biggest concern about serverless was cold start latency. Here's how we addressed it:
Provisioned Concurrency for Critical Functions
// CloudFormation template for provisioned concurrency
const provisionedConcurrencyConfig = {
Type: 'AWS::Lambda::ProvisionedConcurrencyConfig',
Properties: {
FunctionName: 'critical-api-function',
Qualifier: '$LATEST',
ProvisionedConcurrencyConfiguratio: 10
}
};
Connection Pooling and Caching
// Singleton pattern for database connections
class DatabaseConnection {
private static instance: DynamoDBClient | null = null;
static getInstance(): DynamoDBClient {
if (!this.instance) {
this.instance = new DynamoDBClient({
region: process.env.AWS_REGION,
maxAttempts: 3,
// Reuse connections across Lambda invocations
requestHandler: new NodeHttpHandler({
httpsAgent: new Agent({
keepAlive: true,
keepAliveMsecs: 1000,
maxSockets: 10
})
})
});
}
return this.instance;
}
}
// In-memory caching for frequently accessed data
const cache = new Map<string, { value: any, expiry: number }>();
function getCachedData(key: string): any | null {
const cached = cache.get(key);
if (cached && Date.now() < cached.expiry) {
return cached.value;
}
cache.delete(key);
return null;
}
function setCachedData(key: string, value: any, ttlMs: number = 300000): void {
cache.set(key, {
value,
expiry: Date.now() + ttlMs
});
}
Optimization Results
- P99 cold start latency: Reduced from 2.5s to 400ms
- Warm response time: Consistently under 50ms
- Cache hit rate: 85% for frequently accessed data
Lesson 2: Event-Driven Architecture is Powerful
Serverless architecture naturally led us toward event-driven patterns:
Asynchronous Processing Pipeline
// EventBridge event publisher
class EventPublisher {
private client: EventBridgeClient;
constructor() {
this.client = new EventBridgeClient({ region: 'us-east-1' });
}
async publishEvent(eventType: string, data: any): Promise<void> {
const command = new PutEventsCommand({
Entries: [{
Source: 'reachgig.platform',
DetailType: eventType,
Detail: JSON.stringify(data),
EventBusName: 'default'
}]
});
await this.client.send(command);
}
}
// Event handler for user registration
export const handleUserRegistration = async (event: EventBridgeEvent): Promise<void> => {
const userData = JSON.parse(event.detail);
// Parallel processing using Promise.all
await Promise.all([
sendWelcomeEmail(userData),
createUserProfile(userData),
initializeUserPreferences(userData),
trackRegistrationMetrics(userData)
]);
};
// SQS for reliable message processing
export const processMessageQueue = async (event: SQSEvent): Promise<void> => {
const results = await Promise.allSettled(
event.Records.map(async (record) => {
try {
const messageData = JSON.parse(record.body);
await processMessage(messageData);
// Message processed successfully
return { success: true, messageId: record.messageId };
} catch (error) {
console.error('Message processing failed:', error);
// Dead letter queue will handle retries
throw error;
}
})
);
// Log processing results
console.log('Batch processing results:', results);
};
Benefits of Event-Driven Architecture
- Loose coupling: Services communicate through events, not direct calls
- Scalability: Each service scales independently based on its event load
- Resilience: Failed events can be retried automatically
- Observability: Event flows provide clear audit trails
Lesson 3: Cost Optimization Requires Strategy
Achieving 40% cost reduction required ongoing optimization:
Function-Level Cost Monitoring
// Custom CloudWatch metrics for cost tracking
class CostTracker {
private cloudWatch: CloudWatchClient;
constructor() {
this.cloudWatch = new CloudWatchClient({ region: 'us-east-1' });
}
async trackFunctionCost(functionName: string, duration: number, memoryMB: number): Promise<void> {
// Calculate estimated cost based on AWS pricing
const costPerMs = (memoryMB / 1024) * 0.0000166667; // $0.0000166667 per GB-second
const estimatedCost = (duration / 1000) * costPerMs;
await this.cloudWatch.send(new PutMetricDataCommand({
Namespace: 'ReachGig/Costs',
MetricData: [{
MetricName: 'EstimatedCost',
Dimensions: [
{ Name: 'FunctionName', Value: functionName }
],
Value: estimatedCost,
Unit: 'None',
Timestamp: new Date()
}]
}));
}
}
// Memory optimization based on usage patterns
const optimizeMemoryAllocation = (functionMetrics: FunctionMetrics): number => {
const { avgMemoryUsed, p99MemoryUsed, avgDuration } = functionMetrics;
// Memory optimization algorithm
if (p99MemoryUsed < avgMemoryUsed * 0.7) {
// Reduce memory allocation
return Math.max(128, Math.ceil(p99MemoryUsed * 1.2));
} else if (avgDuration > 5000 && avgMemoryUsed > 512) {
// Increase memory for CPU-bound tasks
return Math.min(3008, avgMemoryUsed * 1.5);
}
return avgMemoryUsed;
};
Cost Optimization Results
- 40% reduction in overall infrastructure costs
- 60% reduction in database costs through on-demand billing
- 25% reduction in Lambda costs through memory optimization
- 50% reduction in storage costs through lifecycle policies
Results and Key Metrics
Our serverless architecture delivered impressive results:
Reliability Metrics
- 99.8% uptime over 18 months
- Mean Time to Recovery (MTTR): 4.2 minutes
- Mean Time Between Failures (MTBF): 720 hours
- Zero data loss incidents
Performance Metrics
- API response time: P50: 45ms, P95: 150ms, P99: 400ms
- Database query time: P95: 25ms
- End-to-end request processing: P95: 200ms
- Cold start impact: <2% of total requests
Cost Metrics
- 40% cost reduction compared to traditional infrastructure
- Pay-per-use scaling: Zero waste during low-traffic periods
- Operational cost reduction: 70% less time spent on infrastructure management
Key Takeaways for Serverless Success
Based on our experience, here are the critical factors for serverless success:
1. Design for Event-Driven Architecture
- Embrace asynchronous processing
- Use events to decouple services
- Implement proper error handling and retries
2. Optimize for Cold Starts
- Use provisioned concurrency for critical functions
- Implement connection pooling and caching
- Keep function bundles small and optimized
3. Monitor Everything
- Implement comprehensive logging and tracing
- Set up meaningful alerts based on business metrics
- Use distributed tracing to understand request flows
4. Security First
- Apply least privilege principles to IAM roles
- Use AWS Secrets Manager for sensitive data
- Implement proper input validation and rate limiting
5. Cost Optimization is Ongoing
- Monitor function-level costs and usage
- Right-size memory allocations based on usage patterns
- Use appropriate storage classes and lifecycle policies
Conclusion
Building a production-grade serverless platform taught me that serverless isn't just about eliminating servers—it's about fundamentally rethinking how we architect, deploy, and operate applications. The 99.8% uptime we achieved wasn't just a result of AWS's reliable infrastructure, but of embracing serverless-native patterns and best practices.
The 40% cost reduction came not just from pay-per-use pricing, but from architectural decisions that eliminated waste and optimized resource utilization. Most importantly, the operational simplicity allowed our small team to focus on building features that users cared about, rather than managing infrastructure.
For teams considering serverless architecture, my advice is to start small, measure everything, and gradually build expertise. The learning curve is real, but the benefits—in terms of scalability, cost efficiency, and developer productivity—make it worthwhile.
Serverless isn't the right choice for every use case, but for applications with variable workloads, small teams, and rapid iteration requirements, it can be transformative. The key is understanding its strengths and designing your architecture to leverage them effectively.
Harshavardhan is a Founding Engineer at Mantys Healthcare AI. Previously, he was Founder and CTO of ReachGig, where he built production serverless systems serving thousands of users. Connect with him on LinkedIn for discussions about serverless architecture, AWS, and scaling startups.