Skip to content

RenovoSolutions/cdk-library-cloudwatch-alarms

Repository files navigation

cdk-library-cloudwatch-alarms

WIP - Library to provide constructs, aspects, and construct extensions to more easily set up alarms for AWS resources in CDK code based on AWS recommended alarms list. This project is still in early development so YMMV.

Usage

This library is flexible in its approach and there are multiple paths to configuring alarms depending on how you'd like to work with the recommended alarms.

Feature Availability

Intended feature list as of Aug 2024

  • Aspects to apply recommended alarms to a wide scope such as a whole CDK app
    • Ability to exclude specific alarms
    • Ability to define a default set of alarm actions
    • Ability to modify the configuration of each alarm type
    • Ability to exclude specific resources
  • Constructs to ease alarm configuration for individual resources at a granular scope
    • Constructs for each available alarm according to the coverage table
    • Constructs for applying all recommended alarms to a specific resource
    • Ability to exclude specific alarms from the all recommended alarms construct
  • Extended versions of resource constructs with alarm helper methods

Resource Coverage

If it's not shown it hasn't been worked on.

Service Status Notes
S3 - [x] 4xxErrors
- [x] 5xxErrors
- [ ] OperationsFailedReplication
Replication errors are difficult to set up in CDK at the moment due to rule properties being IResolvables and replication rules not being available on the L2 Bucket construct
SQS - [x] ApproximateAgeOfOldestMessage
- [x] ApproximateNumberOfMessagesNotVisible
- [x] ApproximateNumberOfMessagesVisible
- [x] NumberOfMessagesSent
- All alarms with the exception of number of messages sent require a user defined threshold because its very use-case specific.
- The Aspect only assigns DLQs the ApproximateNumberOfMessagesVisible alarm with a default threshold of 0, unless dlqsGetFullRecommendedAlarms is true, in which case they get the same alarms as other queues. DLQs that belong to a main queue which isn't in the same scope as the Aspect is added to won't be detected as DLQs and they will be treated as normal queues.
SNS - [x] NumberOfMessagesPublished
- [x] NumberOfNotificationsDelivered
- [x] NumberOfNotificationsFailed
- [x] NumberOfNotificationsFilteredOut-InvalidAttributes
- [x] NumberOfNotificationsFilteredOut-InvalidMessageBody
- [x] NumberOfNotificationsRedrivenToDlq
- [x] NumberOfNotificationsFailedToRedriveToDlq
- [ ] SMSMonthToDateSpentUSD
- [ ] SMSSuccessRate
Some alarms require a threshold to be defined. SMS alarms are not implememented.
Lambda - [ ] ClaimedAccountConcurrency
- [x] Errors
- [x] Throttles
- [x] Duration
- [x] ConcurrentExecutions
ClaimedAccountConcurrency is account wide and one time so not covered by this library at this time
RDS For database & cluster instances
- [x] CPUUtilization
- [x] DatabaseConnections
- [x] FreeableMemory
- [x] FreeLocalStorage
- [x] FreeStorageSpace
- [x] ReadLatency
- [x] WriteLatency
- [x] DBLoad

For clusters
- [x] AuroraVolumeBytesLeftTotal
- [x] AuroraBinlogReplicaLag
Some alarms require a threshold to be defined. AuroraVolumeBytesLeftTotal and AuroraBinlogReplicaLag alarms are created only for Aurora MySQL clusters.
ECS - [x] CPUUtilization
- [x] MemoryUtilization
- [x] EphemeralStorageUtilized
- [x] RunningTaskCount
The alarms are applied to FargateService constructs only. EphemeralStorageUtilized requires a threshold to be defined.
EFS - [x] PercentIOLimit
- [x] BurstCreditBalance
The alarms are applied to FileSystem constructs.
ApiGateway - [x] 4XXError
- [x] 5XXError
- [x] Count
- [x] Latency
The alarms are applied to RestApi constructs only. Count requires a threshold to be defined. Alarms are automatically created using the ApiName and Stage dimensions. To create Count or Latency alarms using the Resource and Method dimensions, the corresponding properties must be explicitly specified.
CloudFront - [x] 5xxErrorRate
- [x] OriginLatency
- [x] FunctionValidationErrors
- [x] FunctionExecutionErrors
- [x] FunctionThrottles
The alarms are applied to Distribution constructs only. Both 5xxErrorRate and OriginLatency require a threshold to be defined. To create Function level alarms using the FunctionName dimension, the corresponding properties must be explicitly specified.
DynamoDB Mandatory alarms
- [x] ReadThrottleEvents
- [x] SystemErrors
- [x] WriteThrottleEvents

Replication alarms (optional)
- [x] AgeOfOldestUnreplicatedRecord
- [x] FailedToReplicateRecordCount
- [x] ThrottledPutRecordCount
The alarms are applied to Table constructs only. All the mandatory alarms require a threshold to be defined.
Replication alarms are created only if the corresponding configuration is specified. Each replication alarm has a default DelegatedOperation dimension value:
- AgeOfOldestUnreplicatedRecord: StreamRecords
- FailedToReplicateRecordCount: StreamRecords
- ThrottledPutRecordCount: PutItem
EC2
- [x] CPUUtilization
- [x] StatusCheckFailed

The alarms are applied to Instance constructs.
AutoScaling
- [x] GroupInServiceCapacity

The alarms are applied to AutoScalingGroup constructs. The alarm requires a threshold to be defined and the AutoScalingGroup should have this metric explicitly enabled.
ElastiCache
- [x] DatabaseMemoryUsagePercentage
- [x] EngineCPUUtilization
- [x] ReplicationLag
The alarms are applied to CfnCacheCluster and CfnReplicationGroup constructs. DatabaseMemoryUsagePercentage and ReplicationLag require a threshold to be defined.
PrivateLink Endpoints
- [x] PacketsDropped

Endpoint Services
- [x] RstPacketsSent
The alarms are applied to InterfaceVpcEndpoint and VpcEndpointService constructs. Because these objects do not expose the attributes required by alarms, they cannot be implemented using the Aspect. In all cases, the threshold must be defined.
VPN
- [x] TunnelState

The alarms are applied to CfnVPNConnection constructs.
ELBv2 For ApplicationLoadBalancer
- [x] RejectedConnectionCount
- [x] HTTPCode_ELB_4XX_Count
- [x] HTTPCode_ELB_5XX_Count
- [x] HTTPCode_Target_5XX_Count

For ApplicationTargetGroup
- [x] HealthyHostCount
- [x] UnHealthyHostCount

For NetworkLoadBalancer
- [x] TCP_ELB_Reset_Count
- [x] TCP_Target_Reset_Count

For NetworkTargetGroup
- [x] HealthyHostCount
- [x] UnHealthyHostCount
- For target groups, HealthyHostCount alarm triggers when count falls below threshold (default: 1) and UnHealthyHostCount alarm triggers when count exceeds threshold (default: 0). For load balancers, all alarms trigger when count exceeds threshold (default: 0).
- The HTTPCode_ELB_4XX_Count and HTTPCode_ELB_5XX_Count alarms are defined as anomaly detection alarms instead of flat counts, because there is normally a constant background of such errors.
DMS For ReplicationInstances
- [x] CPUUtilization
- [x] FreeableMemory
- [x] FreeStorageSpace
- [x] WriteIOPS

For Replication Tasks
- [x] CDCThroughputRowsSource
- [x] CDCThroughputRowsTarget
- [x] CDCLatencySource
- [x] CDCLatencyTarget
- [x] FullLoadThroughputRowsSource
- [x] FullLoadThroughputRowsTarget
The alarms are applied to CfnReplicationInstance and CfnReplicationTask constructs.

Replication Instance Notes:
- FreeableMemory and FreeStorageSpace alarms require a threshold to be defined.

Replication Task Notes:
- CDC throughput alarms (CDCThroughputRowsSource and CDCThroughputRowsTarget) default to detecting bulk operations (threshold: 1000 rows/sec, comparison: GREATER_THAN_THRESHOLD) but the comparisonOperator can be overridden to detect low throughput issues instead.
- CDC latency alarms (CDCLatencySource and CDCLatencyTarget) default to detecting high latency issues (threshold: 300 seconds, comparison: GREATER_THAN_THRESHOLD) which can indicate replication lag or database performance problems.
- Full load throughput alarms default to detecting low throughput issues during data migration.

Aspects

Below is an example of configuring the Lambda aspect. You must configure non-defaults for alarms which is most cases is only a threshold. Since the aspect is applied at the app level it applies to both the TestStack and TestStack2 lambda functions and will create all available recommended alarms for those functions. See references for additional details on Aspects which can be applied to the app, stack, or individual constructs depending on your use case.

import { App, Stack, Aspects, aws_lambda as lambda } from 'aws-cdk-lib';
import * as recommendedalarms from '@renovosolutions/cdk-library-cloudwatch-alarms';

const app = new App();
const stack = new Stack(app, 'TestStack', {
  env: {
    account: '123456789012',
    region: 'us-east-1',
  },
});

const stack2 = new Stack(app, 'TestStack2', {
  env: {
    account: '123456789012',
    region: 'us-east-1',
  },
});

const appAspects = Aspects.of(app);

appAspects.add(
  new recommendedalarms.LambdaRecommendedAlarmsAspect({
    configDurationAlarm: {
      threshold: 15,
    },
    configErrorsAlarm: {
      threshold: 1,
    },
    configThrottlesAlarm: {
      threshold: 0,
    },
  }),
);

new lambda.Function(stack, 'Lambda', {
  runtime: lambda.Runtime.NODEJS_20_X,
  handler: 'index.handler',
  code: lambda.Code.fromInline('exports.handler = async (event) => { console.log(event); }'),
});

new lambda.Function(stack2, 'Lambda2', {
  runtime: lambda.Runtime.NODEJS_20_X,
  handler: 'index.handler',
  code: lambda.Code.fromInline('exports.handler = async (event) => { console.log(event); }'),
});

Recommended Alarm Constructs

You can also apply alarms to a specific resource using the recommended alarm construct for a given resource type. For example if you have an S3 Bucket you might do something like below. None of the S3 alarms require configuration so no config props are needed in this case:

import { App, Stack, Aspects, aws_s3 as s3 } from 'aws-cdk-lib';
import * as recommendedalarms from '@renovosolutions/cdk-library-cloudwatch-alarms';

const app = new App();
const stack = new Stack(app, 'TestStack', {
  env: {
    account: '123456789012',
    region: 'us-east-1',
  },
});

const bucket = new s3.Bucket(stack, 'Bucket', {});

new recommendedalarms.S3RecommendedAlarms(stack, 'RecommendedAlarms', {
  bucket,
});

Individual Constructs

You can also apply specific alarms from their individual constructs:

import { App, Stack, Aspects, aws_s3 as s3 } from 'aws-cdk-lib';
import * as recommendedalarms from '@renovosolutions/cdk-library-cloudwatch-alarms';

const app = new App();
const stack = new Stack(app, 'TestStack', {
  env: {
    account: '123456789012',
    region: 'us-east-1',
  },
});

const bucket = new s3.Bucket(stack, 'Bucket', {});

new recommendedalarms.S3Bucket5xxErrorsAlarm(stack, 'RecommendedAlarms', {
  bucket,
  threshold: 0.10,
});

Construct Extensions

You can use extended versions of the constructs you are familiar with to expose helper methods for alarms if you'd like to keep alarms more tightly coupled to specific resources.

import { App, Stack, Aspects, aws_s3 as s3 } from 'aws-cdk-lib';
import * as recommendedalarms from '@renovosolutions/cdk-library-cloudwatch-alarms';

const app = new App();
const stack = new Stack(app, 'TestStack', {
  env: {
    account: '123456789012',
    region: 'us-east-1',
  },
});

  const bucket = new recommendedalarms.Bucket(stack, 'Bucket', {});

  bucket.applyRecommendedAlarms();

Alarm Actions

You can apply alarm actions using the default actions on an aspect or all recommended alarms construct or you can apply individual alarm actions for helper methods of individual constructs. See below where default actions are set but an override is set for a specific alarm for the alarm action to use a different SNS topic.

import { App, Stack, Aspects, aws_lambda as lambda } from 'aws-cdk-lib';
import * as recommendedalarms from '@renovosolutions/cdk-library-cloudwatch-alarms';

const app = new App();
const stack = new Stack(app, 'TestStack', {
  env: {
    account: '123456789012',
    region: 'us-east-1',
  },
});

const stack2 = new Stack(app, 'TestStack2', {
  env: {
    account: '123456789012',
    region: 'us-east-1',
  },
});

const alarmTopic = new sns.Topic(stack, 'Topic');
const topicAction =  new cloudwatch_actions.SnsAction(alarmTopic)

const alarmTopic2 = new sns.Topic(stack, 'Topic');
const topicAction2 =  new cloudwatch_actions.SnsAction(alarmTopic2)

const appAspects = Aspects.of(app);

appAspects.add(
  new recommendedalarms.LambdaRecommendedAlarmsAspect({
    defaultAlarmAction: topicAction,
    defaultOkAction: topicAction,
    defaultInsufficientDataAction: topicAction,
    configDurationAlarm: {
      threshold: 15,
      alarmAction: topicAction2,
    },
    configErrorsAlarm: {
      threshold: 1,
    },
    configThrottlesAlarm: {
      threshold: 0,
    },

  }),
);

new lambda.Function(stack, 'Lambda', {
  runtime: lambda.Runtime.NODEJS_20_X,
  handler: 'index.handler',
  code: lambda.Code.fromInline('exports.handler = async (event) => { console.log(event); }'),
});

new lambda.Function(stack2, 'Lambda2', {
  runtime: lambda.Runtime.NODEJS_20_X,
  handler: 'index.handler',
  code: lambda.Code.fromInline('exports.handler = async (event) => { console.log(event); }'),
});

Exclusions

You can exclude specific alarms or specific resources. Alarms use the available metrics enums and resources use the string used for a resources id. For example below Lambda1 will not have alarms created and there will be no alarm for the Duration metric for either lambda function.

import { App, Stack, Aspects, aws_lambda as lambda } from 'aws-cdk-lib';
import * as recommendedalarms from '@renovosolutions/cdk-library-cloudwatch-alarms';

const app = new App();
const stack = new Stack(app, 'TestStack', {
  env: {
    account: '123456789012', // not a real account
    region: 'us-east-1',
  },
});

const appAspects = Aspects.of(app);

appAspects.add(
  new recommendedalarms.LambdaRecommendedAlarmsAspect({
    excludeResources: ['Lambda1'],
    excludeAlarms: [recommendedalarms.LambdaRecommendedAlarmsMetrics.DURATION],
    configDurationAlarm: {
      threshold: 15,
    },
    configErrorsAlarm: {
      threshold: 1,
    },
    configThrottlesAlarm: {
      threshold: 0,
    },
  }),
);

new lambda.Function(stack, 'Lambda1', {
  runtime: lambda.Runtime.NODEJS_20_X,
  handler: 'index.handler',
  code: lambda.Code.fromInline('exports.handler = async (event) => { console.log(event); }'),
});

new lambda.Function(stack, 'Lambda2', {
  runtime: lambda.Runtime.NODEJS_20_X,
  handler: 'index.handler',
  code: lambda.Code.fromInline('exports.handler = async (event) => { console.log(event); }'),
});

References

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •