2024-12-01
10 min read

Serverless Architecture: The Hard Lessons

Everyone told us not to use serverless for a production startup. We did it anyway. Here is what broke, and how we fixed it.

AWSServerlessArchitecture

Serverless Architecture: The Hard Lessons

When I started building ReachGig, I made a bet on serverless architecture that many people told me was stupid.

"Serverless is for toy apps," they said. "You'll hit cold starts." "Vendor lock-in will kill you."

They weren't entirely wrong. But neither were we.

Eighteen months later, our AWS serverless stack was handling thousands of users with 99.8% uptime, and our infrastructure bill was a fraction of what a traditional EC2 cluster would have cost. But getting there wasn't a straight line. We hit walls. We had APIs meant to be fast that took 3 seconds to load. We had debugging sessions that made me question my career choices.

Here is the honest breakdown of how we built a production-grade serverless platform, including the stuff that sucked and how we fixed it.

Why We Took the Risk

The decision wasn't about hype. It was about survival.

  1. I was the only DevOps engineer. I didn't have time to patch servers or manage Kubernetes clusters. I needed to write feature code.
  2. We were broke. Paying for idle EC2 instances at 3 AM when nobody was using the app was not an option.
  3. Spiky Traffic. We needed to scale from 0 to 1000 and back to 0 in minutes without me waking up to adjust an auto-scaling group.

My 'Frankenstein' Stack

We didn't just use Lambda. We used the whole AWS toy box.

Our final architecture combined multiple AWS services into a cohesive system:

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   CloudFront    │────│   API Gateway    │────│   Lambda Funcs  │
│   (Global CDN)  │    │  (API Routing)   │    │  (Business Logic)│
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                                        │
                       ┌──────────────────┐             │
                       │   DynamoDB       │◄────────────┘
                       │  (Primary DB)    │
                       └──────────────────┘
                                │
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│      S3         │    │   EventBridge    │────│   SQS/SNS       │
│ (File Storage)  │    │  (Event Routing) │    │  (Messaging)    │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Core Components

// API Gateway + Lambda integration
export const apiHandler = async (event: APIGatewayEvent): Promise<APIGatewayResponse> => {
  try {
    // Parse request
    const { httpMethod, path, body } = event;
    const requestData = body ? JSON.parse(body) : {};

    // Route to appropriate handler
    const result = await routeRequest(httpMethod, path, requestData);

    return {
      statusCode: 200,
      headers: {
        'Content-Type': 'application/json',
        'Access-Control-Allow-Origin': '*'
      },
      body: JSON.stringify(result)
    };
  } catch (error) {
    console.error('API Error:', error);
    return {
      statusCode: 500,
      body: JSON.stringify({ error: 'Internal server error' })
    };
  }
};

// DynamoDB operations
class DynamoDBService {
  private client: DynamoDBClient;

  constructor() {
    this.client = new DynamoDBClient({ region: 'us-east-1' });
  }

  async createItem(tableName: string, item: any): Promise<any> {
    const command = new PutItemCommand({
      TableName: tableName,
      Item: marshall(item),
      ConditionExpression: 'attribute_not_exists(id)'
    });

    return await this.client.send(command);
  }

  async getItem(tableName: string, key: any): Promise<any> {
    const command = new GetItemCommand({
      TableName: tableName,
      Key: marshall(key)
    });

    const result = await this.client.send(command);
    return result.Item ? unmarshall(result.Item) : null;
  }
}

Lesson 1: Cold Starts Are Real (But Solvable)

The biggest concern about serverless was cold start latency. Here's how we addressed it:

Provisioned Concurrency for Critical Functions

// CloudFormation template for provisioned concurrency
const provisionedConcurrencyConfig = {
  Type: 'AWS::Lambda::ProvisionedConcurrencyConfig',
  Properties: {
    FunctionName: 'critical-api-function',
    Qualifier: '$LATEST',
    ProvisionedConcurrencyConfiguratio: 10
  }
};

Connection Pooling and Caching

// Singleton pattern for database connections
class DatabaseConnection {
  private static instance: DynamoDBClient | null = null;

  static getInstance(): DynamoDBClient {
    if (!this.instance) {
      this.instance = new DynamoDBClient({
        region: process.env.AWS_REGION,
        maxAttempts: 3,
        // Reuse connections across Lambda invocations
        requestHandler: new NodeHttpHandler({
          httpsAgent: new Agent({
            keepAlive: true,
            keepAliveMsecs: 1000,
            maxSockets: 10
          })
        })
      });
    }
    return this.instance;
  }
}

// In-memory caching for frequently accessed data
const cache = new Map<string, { value: any, expiry: number }>();

function getCachedData(key: string): any | null {
  const cached = cache.get(key);
  if (cached && Date.now() < cached.expiry) {
    return cached.value;
  }
  cache.delete(key);
  return null;
}

function setCachedData(key: string, value: any, ttlMs: number = 300000): void {
  cache.set(key, {
    value,
    expiry: Date.now() + ttlMs
  });
}

Optimization Results

  • P99 cold start latency: Reduced from 2.5s to 400ms
  • Warm response time: Consistently under 50ms
  • Cache hit rate: 85% for frequently accessed data

Lesson 2: Thinking in Events (The Async Mindset)

Serverless architecture naturally led us toward event-driven patterns:

Asynchronous Processing Pipeline

// EventBridge event publisher
class EventPublisher {
  private client: EventBridgeClient;

  constructor() {
    this.client = new EventBridgeClient({ region: 'us-east-1' });
  }

  async publishEvent(eventType: string, data: any): Promise<void> {
    const command = new PutEventsCommand({
      Entries: [{
        Source: 'reachgig.platform',
        DetailType: eventType,
        Detail: JSON.stringify(data),
        EventBusName: 'default'
      }]
    });

    await this.client.send(command);
  }
}

// Event handler for user registration
export const handleUserRegistration = async (event: EventBridgeEvent): Promise<void> => {
  const userData = JSON.parse(event.detail);

  // Parallel processing using Promise.all
  await Promise.all([
    sendWelcomeEmail(userData),
    createUserProfile(userData),
    initializeUserPreferences(userData),
    trackRegistrationMetrics(userData)
  ]);
};

// SQS for reliable message processing
export const processMessageQueue = async (event: SQSEvent): Promise<void> => {
  const results = await Promise.allSettled(
    event.Records.map(async (record) => {
      try {
        const messageData = JSON.parse(record.body);
        await processMessage(messageData);

        // Message processed successfully
        return { success: true, messageId: record.messageId };
      } catch (error) {
        console.error('Message processing failed:', error);

        // Dead letter queue will handle retries
        throw error;
      }
    })
  );

  // Log processing results
  console.log('Batch processing results:', results);
};

Benefits of Event-Driven Architecture

  • Loose coupling: Services communicate through events, not direct calls
  • Scalability: Each service scales independently based on its event load
  • Resilience: Failed events can be retried automatically
  • Observability: Event flows provide clear audit trails

Lesson 3: The "Pay-Per-Use" Trap

Achieving 40% cost reduction required ongoing optimization:

Function-Level Cost Monitoring

// Custom CloudWatch metrics for cost tracking
class CostTracker {
  private cloudWatch: CloudWatchClient;

  constructor() {
    this.cloudWatch = new CloudWatchClient({ region: 'us-east-1' });
  }

  async trackFunctionCost(functionName: string, duration: number, memoryMB: number): Promise<void> {
    // Calculate estimated cost based on AWS pricing
    const costPerMs = (memoryMB / 1024) * 0.0000166667; // $0.0000166667 per GB-second
    const estimatedCost = (duration / 1000) * costPerMs;

    await this.cloudWatch.send(new PutMetricDataCommand({
      Namespace: 'ReachGig/Costs',
      MetricData: [{
        MetricName: 'EstimatedCost',
        Dimensions: [
          { Name: 'FunctionName', Value: functionName }
        ],
        Value: estimatedCost,
        Unit: 'None',
        Timestamp: new Date()
      }]
    }));
  }
}

// Memory optimization based on usage patterns
const optimizeMemoryAllocation = (functionMetrics: FunctionMetrics): number => {
  const { avgMemoryUsed, p99MemoryUsed, avgDuration } = functionMetrics;

  // Memory optimization algorithm
  if (p99MemoryUsed < avgMemoryUsed * 0.7) {
    // Reduce memory allocation
    return Math.max(128, Math.ceil(p99MemoryUsed * 1.2));
  } else if (avgDuration > 5000 && avgMemoryUsed > 512) {
    // Increase memory for CPU-bound tasks
    return Math.min(3008, avgMemoryUsed * 1.5);
  }

  return avgMemoryUsed;
};

Cost Optimization Results

  • 40% reduction in overall infrastructure costs
  • 60% reduction in database costs through on-demand billing
  • 25% reduction in Lambda costs through memory optimization
  • 50% reduction in storage costs through lifecycle policies

Results and Key Metrics

Our serverless architecture delivered impressive results:

Reliability Metrics

  • 99.8% uptime over 18 months
  • Mean Time to Recovery (MTTR): 4.2 minutes
  • Mean Time Between Failures (MTBF): 720 hours
  • Zero data loss incidents

Performance Metrics

  • API response time: P50: 45ms, P95: 150ms, P99: 400ms
  • Database query time: P95: 25ms
  • End-to-end request processing: P95: 200ms
  • Cold start impact: <2% of total requests

Cost Metrics

  • 40% cost reduction compared to traditional infrastructure
  • Pay-per-use scaling: Zero waste during low-traffic periods
  • Operational cost reduction: 70% less time spent on infrastructure management

Key Takeaways for Serverless Success

Based on our experience, here are the critical factors for serverless success:

1. Design for Event-Driven Architecture

  • Embrace asynchronous processing
  • Use events to decouple services
  • Implement proper error handling and retries

2. Optimize for Cold Starts

  • Use provisioned concurrency for critical functions
  • Implement connection pooling and caching
  • Keep function bundles small and optimized

3. Monitor Everything

  • Implement comprehensive logging and tracing
  • Set up meaningful alerts based on business metrics
  • Use distributed tracing to understand request flows

4. Security First

  • Apply least privilege principles to IAM roles
  • Use AWS Secrets Manager for sensitive data
  • Implement proper input validation and rate limiting

5. Cost Optimization is Ongoing

  • Monitor function-level costs and usage
  • Right-size memory allocations based on usage patterns
  • Use appropriate storage classes and lifecycle policies

Conclusion

Serverless isn't magic. It doesn't solve bad code, and it introduces its own set of headaches (distributed tracing is... fun).

But for a small team with big ambitions, it is a cheat code. It allowed us to punch way above our weight class. We didn't have to hire a frantic on-call engineer because AWS was our on-call engineer.

If you are building a startup today, I highly recommend you skip the Kubernetes cluster. Focus on your product. Let Jeff Bezos worry about the servers.


Harshavardhan is a Founding Engineer at Mantys Healthcare AI. He still has nightmares about configuring VPC endpoints.