How to Build a $500/Month Monitoring Stack That Actually Works

We’ve seen too many startups paying $3,000+/month for enterprise monitoring tools like Datadog or New Relic. The pattern is always the same: applications serving millions of requests with a handful of microservices, teams frustrated with costs but feeling locked in because “monitoring is critical.”

Here’s what we’ve learned: you can build a production-grade monitoring stack using AWS-native tools and selective open-source solutions for under $500/month. Same observability. Better retention. Faster queries. Full control over your data.

The Problem with Enterprise Monitoring Tools

Don’t get me wrong Datadog, New Relic, and Dynatrace are excellent products. But their pricing models are optimised for enterprises, not startups.

Typical enterprise monitoring costs for a startup:

Datadog: $1,500-5,000/month (15/host/month + $0.10/GB logs + custom metrics)
New Relic: $1,200-4,000/month (similar pricing model)
Dynatrace: $2,000-8,000/month (even more expensive)

Why so expensive?

Per-host pricing (every EC2 instance, container, Lambda function costs money)
Per-GB log ingestion fees
Custom metrics charges
APM (Application Performance Monitoring) add-ons
Premium features locked behind higher tiers

The trap: Start on generous free tiers, scale your infrastructure, suddenly you’re paying $3K+/month.

The $500/Month Monitoring Stack

Here’s what we built and what it costs:

Core Components

1. CloudWatch Metrics + Alarms

Cost: $50-80/month
What it does: Infrastructure metrics, application metrics, alerting

2. CloudWatch Logs + Logs Insights

Cost: $150-250/month
What it does: Centralized logging, log analysis, structured queries

3. AWS X-Ray

Cost: $20-40/month
What it does: Distributed tracing, request flow visualization, latency analysis

4. Uptime Monitoring (Better Uptime)

Cost: $20/month
What it does: External uptime checks, status page, incident management

5. Error Tracking (Sentry OSS Self-Hosted)

Cost: $50/month (hosting on t3.medium EC2)
What it does: Error tracking, release tracking, user impact analysis

6. CloudWatch Dashboard

Cost: $9/month (3 custom dashboards × $3/dashboard)
What it does: Centralized metrics visualization

7. AWS Budgets + Cost Alerts

Cost: $Free (2 budgets free, $0.02/day per additional)
What it does: Cost monitoring and alerting

Total: ~$450/month

Let’s break down each component and why it was chosen.

Component #1: CloudWatch Metrics + Alarms ($50-80/month)

What it does:

Collects metrics from EC2, ECS, RDS, ALB, Lambda, and custom application metrics
Creates alarms based on metric thresholds
Sends notifications via SNS (email, Slack, PagerDuty)

Setup:

1. Enable detailed monitoring for EC2/ECS

# Enable detailed monitoring (1-minute intervals)
aws ec2 monitor-instances --instance-ids i-1234567890abcdef0

2. Create custom metrics from your application

// Node.js example with aws-sdk
const { CloudWatch } = require('@aws-sdk/client-cloudwatch');
const cw = new CloudWatch({ region: 'ap-southeast-2' });

async function publishMetric(metricName, value) {
  await cw.putMetricData({
    Namespace: 'MyApp/API',
    MetricData: [
      {
        MetricName: metricName,
        Value: value,
        Unit: 'Count',
        Timestamp: new Date(),
      },
    ],
  });
}

// Track custom business metrics
await publishMetric('UserSignups', 1);
await publishMetric('OrdersCompleted', 1);
await publishMetric('RevenueUSD', 49.99);

3. Create alarms for critical metrics

# High API error rate alarm
aws cloudwatch put-metric-alarm \
  --alarm-name high-api-error-rate \
  --alarm-description "Alert when API error rate > 5%" \
  --metric-name 5XXError \
  --namespace AWS/ApplicationELB \
  --statistic Average \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 5 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:ap-southeast-2:123456789:alerts

Key alarms to set up:

High error rate (4xx, 5xx) on ALB
High CPU utilization on EC2/ECS (> 80%)
High memory utilization (> 85%)
RDS connections near limit
Lambda errors and throttles
API latency (p95 > threshold)

Cost breakdown:

CloudWatch Metrics: $0.30/metric/month (first 10,000 free)
CloudWatch Alarms: $0.10/alarm/month (first 10 free)
Typical usage: 50 custom metrics, 20 alarms = ~$15/month
API requests: ~$5/month
Total: $20-30/month

Component #2: CloudWatch Logs + Logs Insights ($150-250/month)

What it does:

Centralised log collection from EC2, ECS, Lambda, RDS
Structured log queries with CloudWatch Logs Insights
Log retention management (reduce costs by aging out old logs)

Setup:

1. Configure application logging

// Use structured JSON logging (critical for Logs Insights queries)
const winston = require('winston');

const logger = winston.createLogger({
  format: winston.format.json(),
  defaultMeta: { service: 'api-service' },
  transports: [
    new winston.transports.Console(),
  ],
});

// Log with structured data
logger.info('User login', {
  userId: '12345',
  email: 'user@example.com',
  ipAddress: '203.0.113.42',
  duration: 340,
});

2. Send logs to CloudWatch from ECS/Fargate

{
  "logConfiguration": {
    "logDriver": "awslogs",
    "options": {
      "awslogs-group": "/ecs/api-service",
      "awslogs-region": "ap-southeast-2",
      "awslogs-stream-prefix": "ecs"
    }
  }
}

3. Query logs with Logs Insights

# Find all errors in the last hour
fields @timestamp, @message, userId, errorMessage
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

# Calculate p95 latency for API endpoints
fields @timestamp, endpoint, duration
| filter endpoint like /^\/api/
| stats avg(duration), pct(duration, 95) by endpoint

# Find slow database queries
fields @timestamp, query, duration
| filter duration > 1000
| sort duration desc
| limit 20

4. Set up log retention policies

# Keep recent logs longer, age out old logs to save money
aws logs put-retention-policy \
  --log-group-name /ecs/api-service \
  --retention-in-days 30

aws logs put-retention-policy \
  --log-group-name /aws/lambda/background-worker \
  --retention-in-days 7

Cost breakdown:

Log ingestion: $0.50/GB
Log storage: $0.03/GB/month
Logs Insights queries: $0.005/GB scanned

Typical startup (10 GB logs/month, 30-day retention):

Ingestion: 10 GB × $0.50 = $5/month
Storage: 10 GB × 30 days × $0.03/GB = $9/month
Queries: 100 GB scanned × $0.005 = $0.50/month
Total: $15-20/month

As you scale (100 GB logs/month):

Ingestion: $50/month
Storage: $90/month (with 30-day retention)
Queries: $5/month
Total: $145/month

Cost optimisation tips:

Filter before sending - Don’t log debug messages in production
Use shorter retention - 7 days for most logs, 30 days for critical logs
Sample high-volume logs - Log 1% of successful requests, 100% of errors
Archive to S3 - Export old logs to S3 ($0.023/GB storage) for long-term retention

Component #3: AWS X-Ray ($20-40/month)

What it does:

Distributed tracing (track requests across microservices)
Service maps (visualise how services call each other)
Latency analysis (find slow services/dependencies)
Error tracking across services

Setup:

1. Install X-Ray daemon on EC2/ECS

# Dockerfile with X-Ray daemon
FROM node:18-alpine

# Install X-Ray daemon
RUN apk add --no-cache curl
RUN curl -o /usr/local/bin/xray https://s3.amazonaws.com/aws-xray-assets.us-east-1/xray-daemon/aws-xray-daemon-linux-3.x
RUN chmod +x /usr/local/bin/xray

# Start X-Ray daemon and application
CMD ["/usr/local/bin/xray", "-o", "&", "&&", "node", "server.js"]

2. Instrument your application

// Node.js Express app with X-Ray
const AWSXRay = require('aws-xray-sdk-core');
const AWS = AWSXRay.captureAWS(require('aws-sdk'));
const express = require('express');

const app = express();

// Enable X-Ray middleware
app.use(AWSXRay.express.openSegment('api-service'));

app.get('/api/users/:id', async (req, res) => {
  // X-Ray automatically traces this HTTP request
  const subsegment = AWSXRay.getSegment().addNewSubsegment('fetch-user');

  try {
    const user = await fetchUserFromDB(req.params.id);
    subsegment.close();
    res.json(user);
  } catch (error) {
    subsegment.addError(error);
    subsegment.close();
    res.status(500).json({ error: 'Internal server error' });
  }
});

// Close X-Ray segment
app.use(AWSXRay.express.closeSegment());

app.listen(3000);

3. Analyse traces in X-Ray console

Service map shows request flow: ALB → API → Database → Cache
Trace view shows exact timing: 120ms total (50ms DB, 20ms cache, 50ms processing)
Filter slow traces: Show all requests > 1 second
Error analysis: Which service is throwing errors?

Cost breakdown:

Traces recorded: $5/million traces
Traces retrieved: $0.50/million traces
Traces stored: First 100,000 per month free, $1/million after

Typical startup (2M requests/month, 10% sampling):

Traces recorded: 200,000 × $5/million = $1/month
Traces retrieved: ~10,000 × $0.50/million = $0.01/month
Total: ~$1-2/month

With 100% sampling (expensive):

2M requests × $5/million = $10/month

Recommendation: Use 5-10% sampling for normal traffic, 100% for errors.

Component #4: Uptime Monitoring ($20/month)

What it does:

External HTTP checks every 30-60 seconds
Alerts when your site is down (before customers notice)
Status page for customers
Incident management

Tool: Better Uptime

Why not CloudWatch for this?

CloudWatch monitors from inside AWS
If AWS region fails, your CloudWatch alarms fail too
Need external monitoring to catch AWS outages

Setup:

Create HTTP checks for critical endpoints:
- Homepage: https://app.example.com
- API health: https://api.example.com/health
- Auth service: https://auth.example.com/health
Set up alert channels:
- Email (immediate)
- Slack (immediate)
- PagerDuty (for on-call rotation)
Create public status page:
- status.example.com
- Shows uptime, incident history
- Customers can subscribe to updates

Cost: $20/month (Better Uptime team plan, 30 checks)

Alternatives:

UptimeRobot: $7/month (50 checks)
Pingdom: $15/month (10 checks)
StatusCake: Free (unlimited checks, 5-minute intervals)

Component #5: Error Tracking - Sentry OSS ($50/month self-hosted)

What it does:

Captures unhandled exceptions and errors
Groups similar errors together
Shows user impact (how many users affected)
Release tracking (which deploy introduced the bug)
Source map support (see original TypeScript/JSX source)

Why self-host instead of Sentry SaaS?

Sentry SaaS: $26/month (5K errors) → $80/month (50K errors)
Self-hosted: $50/month EC2 costs, unlimited errors

Setup (self-hosted on EC2):

# Launch t3.medium EC2 instance (2 vCPU, 4 GB RAM)
# Install Docker and Docker Compose

# Clone Sentry self-hosted repo
git clone https://github.com/getsentry/self-hosted.git
cd self-hosted

# Run install script
./install.sh

# Start Sentry
docker-compose up -d

Instrument your application:

// Node.js example
const Sentry = require('@sentry/node');

Sentry.init({
  dsn: 'https://your-dsn@sentry.example.com/1',
  environment: 'production',
  release: process.env.GIT_COMMIT,
});

// Capture exceptions
try {
  riskyOperation();
} catch (error) {
  Sentry.captureException(error);
  throw error;
}

// Add context to errors
Sentry.setUser({ id: user.id, email: user.email });
Sentry.setContext('order', { orderId: order.id, total: order.total });

Cost:

EC2 t3.medium: $30/month (reserved instance)
EBS storage (100 GB): $10/month
Data transfer: ~$5/month
Total: ~$50/month

Alternative (if you don’t want to self-host):

Sentry SaaS: $26-80/month
Rollbar: $25/month
Bugsnag: $59/month

Component #6: CloudWatch Dashboards ($9/month)

What it does:

Single pane of glass for all metrics
Customizable charts and widgets
Automatic refresh

Setup:

Create 3 dashboards:

1. Application Health Dashboard

API request rate (per minute)
API latency (p50, p95, p99)
Error rate (4xx, 5xx)
Active users
Database connections

2. Infrastructure Dashboard

EC2 CPU utilization
Memory utilization
Disk I/O
Network in/out
ECS task count

3. Business Metrics Dashboard

User signups (per hour)
Orders completed
Revenue (daily)
Active sessions
Conversion funnel

Cost: $3/dashboard/month = $9/month for 3 dashboards

Tip: First 3 dashboards per month are free, then $3 each. Keep it to 3 dashboards to stay in free tier.

Component #7: AWS Budgets + Cost Alerts (Free)

What it does:

Tracks AWS spending
Alerts when costs exceed thresholds
Forecasts end-of-month costs

Setup:

aws budgets create-budget \
  --account-id 123456789 \
  --budget file://budget.json \
  --notifications-with-subscribers file://notifications.json

budget.json:

{
  "BudgetName": "Monthly AWS Costs",
  "BudgetLimit": {
    "Amount": "500",
    "Unit": "USD"
  },
  "TimeUnit": "MONTHLY",
  "BudgetType": "COST"
}

notifications.json:

[
  {
    "Notification": {
      "NotificationType": "ACTUAL",
      "ComparisonOperator": "GREATER_THAN",
      "Threshold": 80
    },
    "Subscribers": [
      {
        "SubscriptionType": "EMAIL",
        "Address": "alerts@example.com"
      }
    ]
  },
  {
    "Notification": {
      "NotificationType": "FORECASTED",
      "ComparisonOperator": "GREATER_THAN",
      "Threshold": 100
    },
    "Subscribers": [
      {
        "SubscriptionType": "EMAIL",
        "Address": "alerts@example.com"
      }
    ]
  }
]

Cost: Free (first 2 budgets free, $0.02/day per additional budget)

Putting It All Together: The Monitoring Workflow

When an Issue Occurs

1. External monitor (Better Uptime) detects downtime

Sends immediate Slack alert: “API is down”
On-call engineer gets paged

2. Engineer opens CloudWatch Dashboard

Sees spike in 5xx errors
Sees spike in API latency
Database CPU at 90%

3. Engineer checks CloudWatch Logs Insights

fields @timestamp, @message, endpoint, duration
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

Finds slow database queries timing out
Identifies which endpoint is affected

4. Engineer checks X-Ray traces

Service map shows API → Database taking 5+ seconds
Trace timeline shows specific SQL query is slow
Identifies the problematic query

5. Engineer checks Sentry

500 users affected by timeout errors
Error first appeared 10 minutes ago
Linked to latest deployment

6. Engineer mitigates

Scales up RDS instance (temporary fix)
Optimizes slow query (permanent fix)
Deploys fix

7. Incident resolution

Updates status page
Posts post-mortem in Slack
Creates ticket to prevent recurrence

Total time to diagnosis: 5 minutes (vs. 20-30 minutes with fragmented tooling)

Cost Optimisation Tips

Tip #1: Use Metric Filters Instead of Custom Metrics

Expensive way:

// Publishing custom metric from application
await cloudwatch.putMetricData({ MetricName: 'APIErrors', Value: 1 });
// Cost: $0.30/metric/month

Cheaper way:

# Create metric filter from logs (free if within log ingestion budget)
aws logs put-metric-filter \
  --log-group-name /ecs/api-service \
  --filter-name api-errors \
  --filter-pattern "[timestamp, level=ERROR, ...]" \
  --metric-transformations \
  metricName=APIErrors,metricNamespace=MyApp,metricValue=1
# Cost: $0 (just log analysis, no additional metric charges)

Savings: ~$10-20/month

Tip #2: Sample High-Volume Logs

Don’t log every successful API request at scale.

// Sample 1% of successful requests, log 100% of errors
app.use((req, res, next) => {
  const shouldLog = res.statusCode >= 400 || Math.random() < 0.01;

  if (shouldLog) {
    logger.info('API request', {
      method: req.method,
      path: req.path,
      status: res.statusCode,
      duration: res.duration,
    });
  }

  next();
});

Savings: 90% reduction in log volume = ~$100-200/month at scale

Tip #3: Use Shorter Log Retention

Default: 30 days (costs add up for high-volume logs)

Optimised:

Application logs: 7 days
Error logs: 30 days
Audit logs: 90 days (compliance requirement)
Archive old logs to S3: $0.023/GB vs. $0.03/GB in CloudWatch

Savings: ~$50-100/month

Tip #4: Self-Host Where It Makes Sense

Sentry SaaS: $80/month for 50K errors Self-hosted Sentry: $50/month EC2 + unlimited errors

When to self-host:

High volume (more than free tiers)
Predictable costs matter
You have DevOps resources

When NOT to self-host:

Small volume (free tiers cover you)
Don’t want operational overhead
Value SaaS features (integrations, support)

What This Stack Can’t Do (vs. Enterprise Tools)

Limitations vs. Datadog/New Relic:

1. No unified UI

Datadog has one dashboard for everything
This stack requires switching between AWS console, Sentry, Better Uptime

2. Less sophisticated APM

X-Ray is good but not as powerful as Datadog APM or New Relic
Missing flame graphs, advanced profiling
Workaround: Add open-source profiling (pyroscope, pprof) if needed

3. No machine learning anomaly detection

Datadog uses ML to detect anomalies automatically
This stack requires manual threshold-based alarms
Workaround: Use CloudWatch Anomaly Detection (extra cost)

4. Steeper learning curve

Datadog is designed for ease of use
This stack requires AWS knowledge and some DIY integration

5. Limited multi-cloud support

This stack is AWS-native
Datadog works across AWS, GCP, Azure seamlessly

When you should upgrade to enterprise monitoring:

Your AWS bill is > $20K/month (monitoring cost becomes less significant)
You need unified cross-cloud monitoring (AWS + GCP + Azure)
Your team wants ML-powered anomaly detection
You value a single pane of glass over cost savings

Conclusion

Total monthly cost: ~$450

What you get:

Infrastructure monitoring (CloudWatch Metrics + Alarms)
Centralized logging with powerful queries (CloudWatch Logs Insights)
Distributed tracing (AWS X-Ray)
Error tracking (Sentry self-hosted)
Uptime monitoring (Better Uptime)
Cost monitoring (AWS Budgets)

vs. Datadog at ~$3,200/month for the same workload

Savings: $2,750/month = $33,000/year

Time to set up: 1-2 days for experienced engineer, 3-5 days for beginner

Ongoing maintenance: 2-4 hours/month (mostly Sentry updates and dashboard tweaks)

ROI: Pays for itself in the first month if you’re currently on expensive enterprise monitoring.

This stack isn’t perfect. It requires more setup and AWS knowledge than buying Datadog. But for cost-conscious startups, it’s a battle-tested solution that provides 90% of the observability at 15% of the cost.

Need help setting up a cost-effective monitoring stack for your AWS infrastructure? Our DevOps Automation Services include complete monitoring setup with CloudWatch, X-Ray, logging, alerting, and dashboard configuration delivered in 6 weeks for $8,500.