Serverless Architecture: The Hard Lessons
Everyone told us not to use serverless for a production startup. We did it anyway. Here is what broke, and how we fixed it.
Serverless Architecture: The Hard Lessons
When I started building ReachGig, I made a bet on serverless architecture that many people told me was stupid.
"Serverless is for toy apps," they said. "You'll hit cold starts." "Vendor lock-in will kill you."
They weren't entirely wrong. But neither were we.
Eighteen months later, our AWS serverless stack was handling thousands of users with 99.8% uptime, and our infrastructure bill was a fraction of what a traditional EC2 cluster would have cost. But getting there wasn't a straight line. We hit walls. We had APIs meant to be fast that took 3 seconds to load. We had debugging sessions that made me question my career choices.
Here is the honest breakdown of how we built a production-grade serverless platform, including the stuff that sucked and how we fixed it.
Why We Took the Risk
The decision wasn't about hype. It was about survival.
- I was the only DevOps engineer. I didn't have time to patch servers or manage Kubernetes clusters. I needed to write feature code.
- We were broke. Paying for idle EC2 instances at 3 AM when nobody was using the app was not an option.
- Spiky Traffic. We needed to scale from 0 to 1000 and back to 0 in minutes without me waking up to adjust an auto-scaling group.
My 'Frankenstein' Stack
We didn't just use Lambda. We used the whole AWS toy box.
Our final architecture combined multiple AWS services into a cohesive system:
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ CloudFront │────│ API Gateway │────│ Lambda Funcs │
│ (Global CDN) │ │ (API Routing) │ │ (Business Logic)│
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
┌──────────────────┐ │
│ DynamoDB │◄────────────┘
│ (Primary DB) │
└──────────────────┘
│
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ S3 │ │ EventBridge │────│ SQS/SNS │
│ (File Storage) │ │ (Event Routing) │ │ (Messaging) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
Core Components
// API Gateway + Lambda integration
export const apiHandler = async (event: APIGatewayEvent): Promise<APIGatewayResponse> => {
try {
// Parse request
const { httpMethod, path, body } = event;
const requestData = body ? JSON.parse(body) : {};
// Route to appropriate handler
const result = await routeRequest(httpMethod, path, requestData);
return {
statusCode: 200,
headers: {
'Content-Type': 'application/json',
'Access-Control-Allow-Origin': '*'
},
body: JSON.stringify(result)
};
} catch (error) {
console.error('API Error:', error);
return {
statusCode: 500,
body: JSON.stringify({ error: 'Internal server error' })
};
}
};
// DynamoDB operations
class DynamoDBService {
private client: DynamoDBClient;
constructor() {
this.client = new DynamoDBClient({ region: 'us-east-1' });
}
async createItem(tableName: string, item: any): Promise<any> {
const command = new PutItemCommand({
TableName: tableName,
Item: marshall(item),
ConditionExpression: 'attribute_not_exists(id)'
});
return await this.client.send(command);
}
async getItem(tableName: string, key: any): Promise<any> {
const command = new GetItemCommand({
TableName: tableName,
Key: marshall(key)
});
const result = await this.client.send(command);
return result.Item ? unmarshall(result.Item) : null;
}
}
Lesson 1: Cold Starts Are Real (But Solvable)
The biggest concern about serverless was cold start latency. Here's how we addressed it:
Provisioned Concurrency for Critical Functions
// CloudFormation template for provisioned concurrency
const provisionedConcurrencyConfig = {
Type: 'AWS::Lambda::ProvisionedConcurrencyConfig',
Properties: {
FunctionName: 'critical-api-function',
Qualifier: '$LATEST',
ProvisionedConcurrencyConfiguratio: 10
}
};
Connection Pooling and Caching
// Singleton pattern for database connections
class DatabaseConnection {
private static instance: DynamoDBClient | null = null;
static getInstance(): DynamoDBClient {
if (!this.instance) {
this.instance = new DynamoDBClient({
region: process.env.AWS_REGION,
maxAttempts: 3,
// Reuse connections across Lambda invocations
requestHandler: new NodeHttpHandler({
httpsAgent: new Agent({
keepAlive: true,
keepAliveMsecs: 1000,
maxSockets: 10
})
})
});
}
return this.instance;
}
}
// In-memory caching for frequently accessed data
const cache = new Map<string, { value: any, expiry: number }>();
function getCachedData(key: string): any | null {
const cached = cache.get(key);
if (cached && Date.now() < cached.expiry) {
return cached.value;
}
cache.delete(key);
return null;
}
function setCachedData(key: string, value: any, ttlMs: number = 300000): void {
cache.set(key, {
value,
expiry: Date.now() + ttlMs
});
}
Optimization Results
- P99 cold start latency: Reduced from 2.5s to 400ms
- Warm response time: Consistently under 50ms
- Cache hit rate: 85% for frequently accessed data
Lesson 2: Thinking in Events (The Async Mindset)
Serverless architecture naturally led us toward event-driven patterns:
Asynchronous Processing Pipeline
// EventBridge event publisher
class EventPublisher {
private client: EventBridgeClient;
constructor() {
this.client = new EventBridgeClient({ region: 'us-east-1' });
}
async publishEvent(eventType: string, data: any): Promise<void> {
const command = new PutEventsCommand({
Entries: [{
Source: 'reachgig.platform',
DetailType: eventType,
Detail: JSON.stringify(data),
EventBusName: 'default'
}]
});
await this.client.send(command);
}
}
// Event handler for user registration
export const handleUserRegistration = async (event: EventBridgeEvent): Promise<void> => {
const userData = JSON.parse(event.detail);
// Parallel processing using Promise.all
await Promise.all([
sendWelcomeEmail(userData),
createUserProfile(userData),
initializeUserPreferences(userData),
trackRegistrationMetrics(userData)
]);
};
// SQS for reliable message processing
export const processMessageQueue = async (event: SQSEvent): Promise<void> => {
const results = await Promise.allSettled(
event.Records.map(async (record) => {
try {
const messageData = JSON.parse(record.body);
await processMessage(messageData);
// Message processed successfully
return { success: true, messageId: record.messageId };
} catch (error) {
console.error('Message processing failed:', error);
// Dead letter queue will handle retries
throw error;
}
})
);
// Log processing results
console.log('Batch processing results:', results);
};
Benefits of Event-Driven Architecture
- Loose coupling: Services communicate through events, not direct calls
- Scalability: Each service scales independently based on its event load
- Resilience: Failed events can be retried automatically
- Observability: Event flows provide clear audit trails
Lesson 3: The "Pay-Per-Use" Trap
Achieving 40% cost reduction required ongoing optimization:
Function-Level Cost Monitoring
// Custom CloudWatch metrics for cost tracking
class CostTracker {
private cloudWatch: CloudWatchClient;
constructor() {
this.cloudWatch = new CloudWatchClient({ region: 'us-east-1' });
}
async trackFunctionCost(functionName: string, duration: number, memoryMB: number): Promise<void> {
// Calculate estimated cost based on AWS pricing
const costPerMs = (memoryMB / 1024) * 0.0000166667; // $0.0000166667 per GB-second
const estimatedCost = (duration / 1000) * costPerMs;
await this.cloudWatch.send(new PutMetricDataCommand({
Namespace: 'ReachGig/Costs',
MetricData: [{
MetricName: 'EstimatedCost',
Dimensions: [
{ Name: 'FunctionName', Value: functionName }
],
Value: estimatedCost,
Unit: 'None',
Timestamp: new Date()
}]
}));
}
}
// Memory optimization based on usage patterns
const optimizeMemoryAllocation = (functionMetrics: FunctionMetrics): number => {
const { avgMemoryUsed, p99MemoryUsed, avgDuration } = functionMetrics;
// Memory optimization algorithm
if (p99MemoryUsed < avgMemoryUsed * 0.7) {
// Reduce memory allocation
return Math.max(128, Math.ceil(p99MemoryUsed * 1.2));
} else if (avgDuration > 5000 && avgMemoryUsed > 512) {
// Increase memory for CPU-bound tasks
return Math.min(3008, avgMemoryUsed * 1.5);
}
return avgMemoryUsed;
};
Cost Optimization Results
- 40% reduction in overall infrastructure costs
- 60% reduction in database costs through on-demand billing
- 25% reduction in Lambda costs through memory optimization
- 50% reduction in storage costs through lifecycle policies
Results and Key Metrics
Our serverless architecture delivered impressive results:
Reliability Metrics
- 99.8% uptime over 18 months
- Mean Time to Recovery (MTTR): 4.2 minutes
- Mean Time Between Failures (MTBF): 720 hours
- Zero data loss incidents
Performance Metrics
- API response time: P50: 45ms, P95: 150ms, P99: 400ms
- Database query time: P95: 25ms
- End-to-end request processing: P95: 200ms
- Cold start impact: <2% of total requests
Cost Metrics
- 40% cost reduction compared to traditional infrastructure
- Pay-per-use scaling: Zero waste during low-traffic periods
- Operational cost reduction: 70% less time spent on infrastructure management
Key Takeaways for Serverless Success
Based on our experience, here are the critical factors for serverless success:
1. Design for Event-Driven Architecture
- Embrace asynchronous processing
- Use events to decouple services
- Implement proper error handling and retries
2. Optimize for Cold Starts
- Use provisioned concurrency for critical functions
- Implement connection pooling and caching
- Keep function bundles small and optimized
3. Monitor Everything
- Implement comprehensive logging and tracing
- Set up meaningful alerts based on business metrics
- Use distributed tracing to understand request flows
4. Security First
- Apply least privilege principles to IAM roles
- Use AWS Secrets Manager for sensitive data
- Implement proper input validation and rate limiting
5. Cost Optimization is Ongoing
- Monitor function-level costs and usage
- Right-size memory allocations based on usage patterns
- Use appropriate storage classes and lifecycle policies
Conclusion
Serverless isn't magic. It doesn't solve bad code, and it introduces its own set of headaches (distributed tracing is... fun).
But for a small team with big ambitions, it is a cheat code. It allowed us to punch way above our weight class. We didn't have to hire a frantic on-call engineer because AWS was our on-call engineer.
If you are building a startup today, I highly recommend you skip the Kubernetes cluster. Focus on your product. Let Jeff Bezos worry about the servers.
Harshavardhan is a Founding Engineer at Mantys Healthcare AI. He still has nightmares about configuring VPC endpoints.