Why
Formerly in superwerker (I was contributing to) we built the so-called rootmail feature (see its ADR). In nutshell
Each AWS account needs one unique email address (the so-called “AWS account root user email address”).
Access to these email addresses must be adequately secured since they provide privileged access to AWS accounts, such as account deletion procedures.
This is why you only need 1 mailing list for the AWS Management (formerly root) account: we recommend
aws-roots+<uuid>@mycompany.test
we recommendaws-roots+<uuid>@mycompany.test
NOTE: maximum 64 character are allowed for the whole address. And as you own the domain mycompany.test
you can add a subdomain, e.g. aws
, for which all E-Mails will then be received with this solution within this particular AWS Management account.
It was only available in CloudFormation and I wanted to migrate it to cdk
to learn more about it and to make it available for the community. So let’s start and I share with you the journey
The challenge
After having completed several cdk
courses (e.g. this one, or the offical one) I wanted to apply my knowledge and build a construct according to best practices.
The solution
First, let take a look a the final solitions architecture, and the let’s dive into the way of getting there
Local testing
First without the integ test, which I’ll explain later.
- Create a seprate project
- add the dependency in the
.projenrc.ts
file and runnpm run projen
to import it
import { awscdk } from 'projen';
const project = new awscdk.AwsCdkTypeScriptApp({
// other settings
deps: [
'file:../awscdk-rootmail',
],
});
However I got the error Types have separate declarations of a private property 'host'.ts(2345) this: this
export class MyStack extends Stack {
constructor(scope: Construct, id: string, props: StackProps = {}) {
super(scope, id, props);
// this param 👇 caused the issue
new Rootmail(this, 'testRootmail', {
domain: 'mavogel.xyz', // my testing domain 😊
});
}
}
After some research I found out that the cdk
team is aware of that issue. I also found someone asking a similar question in the cdk-slack and Matthew Bonig writing a blog post about it as well. This did not work for me, so I came up with a more simple solution
$ rm -rf node_modules/awscdk-rootmail/node_modules/constructs/`
Which removed the double occurence of the constructs
module in the tree. And then I was able to run npm run deploy
successfully.
Note: you might be thinking why is he not using the cdk integ-test module to test his construct? Be patient, I switched to it later on.
Back to the initial attempt, the next error occurred:
Cannot find index file at awscdk-rootmail-test/node_modules/awscdk-rootmail/lib/functions/hosted_zone_dkim_verification_records_cr/index.py
Ok then I thought, let’s migrate the Lambda functions to TypeScript as well.
Rewriting the Lambda functions to TypeScript
I mainly used Philipp Garbe’s post, how he writes/develops Lambda functions in TypeScript and also certain baselines from the superwerker project, like the Custom resource
on generate-email-address.ts
.
I first kept the converted lambda functions in the functions
folder, however they could not be found in the awscdk-rootmail-test/node_modules/awscdk-rootmail/lib/functions
folder. So I moved them in a flat folder structure as the following image shows, with the known naming conventions.
However, when trying to deploy, I encountere the next error "Cannot find Package" errors when I run Lambda code in Node.js?
{
"errorType": "Runtime.ImportModuleError",
"errorMessage": "Error: Cannot find module 'aws-sdk'\nRequire stack:\n- /var/task/index.js\n- /var/runtime/index.mjs",
"stack": [
"Runtime.ImportModuleError: Error: Cannot find module 'aws-sdk'",
"Require stack:",
"- /var/task/index.js",
"- /var/runtime/index.mjs",
" at _loadUserApp (file:///var/runtime/index.mjs:997:17)",
" at async UserFunction.js.module.exports.load (file:///var/runtime/index.mjs:1029:21)",
" at async start (file:///var/runtime/index.mjs:1192:23)",
" at async file:///var/runtime/index.mjs:1198:1"
]
}
After researching, an AWS repost gave me more clarity by denoting
For Node.js runtimes 16 and earlier, Lambda doesn’t support layered JavaScript ES module dependencies. You must include the dependencies in the deployment. Lambda supports JavaScript ES module dependencies for Node.js 18.
So, I switched to the Node.js 18 runtime for all Lambda functions:
const rootMailReady = new NodejsFunction(this, 'ready-handler', {
// 👇 solved the layer issue with the 'aws-sdk' dependency
runtime: lambda.Runtime.NODEJS_18_X,
environment: {
DOMAIN: domain,
SUB_DOMAIN: subdomain,
},
});
Now it finally worked and it could verify it by sending mails from my GMail address
However the next step will be to have it tested with the cdk integ tests
.
Adding cdk integ tests
As mentioned before I wanted proper testing and get away from the local awscdk-rootmail-test
project. All the tests
are based on the previously existing rootmail_test.py file from the superwerker project. So I started mapping then to TypeScript
.
NOTE: the @aws-cdk/integ-tests-alpha
package is still alpha
state, so I did expect certain things to not work as expected. However, I was able to get it working and I am happy to share my findings with you.
Baseline
My challenges were:
- no auto-completion for the
awsApiCall
method. So going to the docs of aws-sdk-js was the only way to find out what the parameters are, as I don’t know them all be memory 😅 Ok this was basic RTFM of the sdk docs.
const getHostedZoneParametersAssertion = integ.assertions
/**
* Check that parameter is present
*/
.awsApiCall('SSM', 'getParameter', {
Name: rootmail.hostedZoneParameterName,
})
.expect(
ExpectedResult.objectLike({
Parameter: {
Name: rootmail.hostedZoneParameterName,
Type: 'StringList',
},
}),
);
- I could not implemented all the cases with
awsApiCall
method, especially having multipleaws-sdk
call and passing data from one to another. So a more flexible option how to deploy and execute lambda to invoke in tests? I summarized it in the issue of the greatcdk-integ-tests-sample
GitHub repository. And also looked into the officalaws-cdk
test suites, like e.g. the one for api-gateway. The following is a snippet from theinteg.rootmail.ts
test file. You find the whole code in the linked project at the end of the post.
const closeOpsItemHandler = new NodejsFunction(stackUnderTest, 'close-opsitem-handler', {
entry: path.join(__dirname, 'functions', 'close-opsitem-handler.ts'),
runtime: lambda.Runtime.NODEJS_18_X,
logRetention: 1,
timeout: Duration.seconds(30),
initialPolicy: [
// policies 👇 go here
new iam.PolicyStatement({
actions: [
'ssm:GetOpsSummary',
'ssm:UpdateOpsItem',
],
resources: ['*'],
}),
],
});
// ...
const updateOpsItemAssertion = integ.assertions
.invokeFunction({
functionName: closeOpsItemHandler.functionName,
// to be able to 👇 debug
logType: LogType.TAIL,
// to run it synchronously ----- 👇
invocationType: InvocationType.REQUEST_RESPONE,
// found this 👇 in the aws-cdk test suite for api-gateway
payload: JSON.stringify({
title: id,
}),
}).expect(ExpectedResult.objectLike(
// as the object 'return { closeStatusCode: 200 };'
// is wrapped in a Payload object with other properties 🙃
{
Payload: {
closeStatusCode: 200,
},
},
),
);
- The email sending assertion to kick the whole test off was tricky, as I received the following error message:
Received response status [FAILED] from custom resource. Message returned: Email address is not verified. The following identities failed the check in region EU-CENTRAL-1: test@aws-test.mavogel.xyz, root+test-id-1@aws-test.mavogel.xyz
I saw two possible solutions:
- either get SES out of sandbox into production mode (in the testing aws account, as receiving works)
- or add a verified email address OR domain (see stackoverflow) which was the case for
eu-west-1
In the initial solution design, the SES
EMail receiver is in eu-west-1
and the error message denotes that the domain verification is in eu-central-1
, so I adapted/fixed this in the assertion function by initiating the sdk acccordingly:
const SES = new AWS.SES({ region: 'eu-west-1' });
export const handler = async (event: any) => {
// ...
}
- The next challenge was: how to autowire the DNS setup for testing. I found the following constraint for least privilege IAM policies for Route53 (excerpt from
chatgpt
):
AWS IAM does not support granular permissions down to the resource record set level within Route53. This means you cannot restrict the ChangeResourceRecordSets permission to only apply to specific record sets (like NS records or certain domain names).
IAM policies can limit permissions to specific hosted zones via the resource ARN, but they cannot get more specific than that within Route53. This is primarily due to the fact that DNS record sets (like NS, A, AAAA, CNAME, etc.) aren’t individual resources with their own ARNs.
If you want to limit the impact of a given IAM role that can change DNS records, you might have to take an application-side approach, such as implementing the checks in the application code itself, in the AWS Lambda function in this case, to prevent changes to other record types or domains.
Ok the lambda
function was straight forward, but I encountered the following issue that although the SES recipient rule to S3 and the lambda function was wired correctly, the email neither delivered to the S3 bucket nor the lambda function was invoked. I deployed a separate lambda function manually and it worked as expected. So I was confused. Looking at the S3
bucket for the mails, I realized during my manual testing, that I had 1m30s between the SES_SETUP_NOTIFICATION_MAIL
and the actual processed mail. So I added configurable sleep (because we also do not want to have this in the unit test) with a default of 2 minutes. And voila it worked. So it was a race condition issue in SES recipient rule setup. It took me 1 day to find out 😅 however I learned a lot, especially about debugging.
more obstacles
There was a race condition in integ test, as I realized that the cleanupAssertion
function, was called sequentially however when the stack was created AND when it was updated. Meaning resouces when cleaned up in the middle of the test run. I understood this when I reallized how the integ-runner
works.
// check the parameter store
getHostedZoneParametersAssertion
// Send a test email
.next(sendTestEmailAssertion)
// Validate an OPS item was created.
.next(validateOpsItemAssertion) // <-
// Close the OPS item that was created.
.next(updateOpsItemAssertion) // <-
// call teardown 👇 lambda
.next(cleanupAssertion);
Fow now I implemented the logic that all CR cleanup themselves and for the S3 Bucket containing the mails, I have a separate python script which needs to be run manually. As a new bucket is created per test run with a random suffix it is fine for now to clean it up manually as follows:
import boto3
import sys
def empty_and_delete_bucket(bucket_name):
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket_name)
# Empty the bucket
print(f"Emptying bucket: {bucket_name}")
for obj_version in bucket.object_versions.all():
obj_version.delete()
# Delete the bucket
bucket.delete()
print(f"Bucket {bucket_name} deleted")
if __name__ == '__main__':
if len(sys.argv) < 2:
print("Please provide the bucket name as a parameter.")
sys.exit(1)
bucket_name = sys.argv[1]
empty_and_delete_bucket(bucket_name)
As of now for invokeFunction().waitForAssertions()
I could not use the polling on a lambda function, as the
error shows
2023-09-07T16:12:00.039Z 92341375-fb62-4589-82e8-b0802ea4102c INFO AccessDeniedException: User: arn:aws:sts::123456789012:assumed-role/SetupTestDefaultTestDeplo-SingletonFunction76b3e83-MF49XEZ4HA0J/SetupTestDefaultTestDeplo-SingletonFunction76b3e83-6VsSOMHQs31T is not authorized to perform: lambda:InvokeFunction on resource: arn:aws:lambda:eu-west-1:123456789012:function:RootmailTestStack-closeopsitemhandler2F03D32C-U06t2LsB3GQR because no identity-based policy allows the lambda:InvokeFunction action
so I had to implement it by myself. However I am not sure if this is the best way to do it, but it works for now!
Finally
After all the testing was done, I added documentation and approached Thorsten Höger from taimos for a cdk-app-review to learn from the best, on what I could improve. For me, he is one of the best cdk developers I know and he is also a great and patient teacher. He gave me a lot of valuable feedback, which I will incorporate in the next steps.
The cdk-app-review
First things first: I decided to learn from the experts. Thorsten Höger is one of the best cdk developers I know and he is also a great teacher. So I highly recommend to get your cdk project reviewed by him.
Before the review
I used to utilize PhysicalName.GENERATE_IF_NEEDED
for certain resource naming conventions, which was associated with issues such as those discussed here.
Parameters were employed only as indexed objects. For further insights on how we managed TypeScript Maps, you may refer to this blog post.
Originally, I used integrated stacksets, transitioning from this discussion to utilizing cdk-stacksets. A notable advantage of using CDK native features is that they inherently possess multi-account and region capabilities, similar to Terraform.
By far one of the feedback was, that the code in the rootmail.ts
file was too complex. However I wanted to stick
first to the original implementation and then refactor it. So I did not change it before the review.
- the
rootmailReady
function checks for a max of 260s (2m20s) if the SES setup is done / the DNS is wired
const rootMailReady = new NodejsFunction(this, 'ready-handler', {
runtime: lambda.Runtime.NODEJS_18_X,
// # the timeout effectivly limits retries to 2^(n+1) - 1 = 9 attempts with backup
// as the function is called every 5 minutes from the event rule
timeout: Duration.seconds(260),
logRetention: 3,
environment: {
DOMAIN: domain,
SUB_DOMAIN: subdomain,
},
});
// more code
this.rootMailReadyEventRule = new events.Rule(this, 'RootMailReadyEventRule', {
schedule: events.Schedule.rate(Duration.minutes(5)),
});
- to then if it does not run into the timeout, it will put a
Cloudwatch
to green
const rootMailReadyAlert = new cw.Alarm(this, 'Errors', {
alarmName: 'superwerker-RootMailReady',
comparisonOperator: cw.ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
metric: new cw.Metric({
namespace: 'AWS/Lambda',
metricName: 'Errors',
period: Duration.seconds(180),
statistic: 'Sum',
dimensionsMap: {
// see the function name 👇
FunctionName: rootMailReady.functionName,
},
}),
evaluationPeriods: 1,
threshold: 1,
});
- which then triggers the
rootMailReadyTrigger
function
const rootMailReadyTriggerEventPattern = new events.Rule(this, 'RootMailReadyTriggerEventPattern', {
eventPattern: {
detailType: ['CloudWatch Alarm State Change'],
source: ['aws.cloudwatch'],
detail: {
alarmName: [rootMailReadyAlert.alarmName],
state: {
value: ['OK'],
},
},
},
});
rootMailReadyTriggerEventPattern.addTarget(new LambdaFunction(rootMailReadyTrigger));
- which then triggers the
rootMailReadyHandle
wait condition for the stack
const rootMailReadyHandle = new CfnWaitConditionHandle(this, 'RootMailReadyHandle');
new CfnWaitCondition(this, 'RootMailReadyHandleWaitCondition', {
handle: rootMailReadyHandle.ref,
timeout: totalTimeToWireDNS.toSeconds().toString(),
});
const rootMailReadyTrigger = new NodejsFunction(this, 'ready-trigger-handler', {
runtime: lambda.Runtime.NODEJS_18_X,
timeout: Duration.seconds(10),
logRetention: 3,
environment: {
// HTTP POST URL to trigger 👇 the wait condition
SIGNAL_URL: rootMailReadyHandle.ref,
ROOTMAIL_READY_EVENTRULE_NAME: this.rootMailReadyEventRule.ruleName,
AUTOWIRE_DNS_EVENTRULE_NAME: autowireDNSEventRuleName,
},
});
I have no clue why this was built like back then but it works 😅 and I will refactor it in the next steps.
However, rebuilding it in cdk
was slight pain, but I learned a lot about the cdk
and aws
ecosystem, which was fine.
After the review
I opted to generate S3 bucket names as everything is now consolidated into a single stack, enhancing manageability and consistency.
The utilization of grants
was adopted because it seamlessly manages permissions, such as automatically handling kms
key permissions when required, thereby streamlining access management.
I shifted to using the ??
(nullish coalescing) operator instead of ||
to ensure that only null or undefined values trigger the use of a default value, thus making our conditionals more robust and accurate.
Instead of manually passing the Route53 hosted zone ID as a parameter, I now retrieve it via lookup, enhancing automation and reducing the potential for human error for a typo.
In order to achieve a more streamlined logic and to harness AWS CDK’s capabilities more effectively, I adopted isCompleteHandlers
in AWS CDK custom resources. This was the biggest change in the codebase, as it required a complete rewrite of the custom resource logic. The following code snippet illustrates the implementation of isCompleteHandlers
:
const route53 = new Route53();
const ssm = new SSM();
export interface IsCompleteHandlerResponse {
IsComplete: boolean;
}
export async function handler(event: AWSCDKAsyncCustomResource.OnEventRequest): Promise<IsCompleteHandlerResponse> {
const hostedZoneParameterName = event.ResourceProperties[PROP_R53_HANGEINFO_ID_PARAMETER_NAME];
const recordSetCreationResponseChangeInfoIdParam = await ssm.getParameter({
Name: hostedZoneParameterName,
}).promise();
const recordSetCreationResponseChangeInfoId = recordSetCreationResponseChangeInfoIdParam.Parameter?.Value as string;
log(`got R53 change info id: ${recordSetCreationResponseChangeInfoId} for event type ${event.RequestType}`);
log({
msg: 'event',
event,
});
switch (event.RequestType) {
case 'Create':
log('waiting for DNS to propagate');
try {
// we us the waiter, however with a small delay and only 1 attempt
// as the polling logic is handled by the CR itself
const res = await route53.waitFor('resourceRecordSetsChanged', {
Id: recordSetCreationResponseChangeInfoId,
// Note: the default is 30s delay and 60 attempts
$waiter: {
delay: 2,
maxAttempts: 1,
},
}).promise();
if (res.ChangeInfo.Status !== 'INSYNC') {
log(`DNS propagation not in sync yet. Has status ${res.ChangeInfo.Status}`);
return { IsComplete: false };
}
log(`DNS propagated with status ${res.ChangeInfo.Status}`);
return { IsComplete: true };
} catch (e) {
log(`DNS propagation errored. Has message ${e}`);
return { IsComplete: false };
}
case 'Update':
case 'Delete':
return {
IsComplete: true,
};
}
}
function log(msg: any) {
console.log(JSON.stringify(msg));
}
which is then plugged into the provider and called interatively. Meaning, this simplified the whole polling which before was wrapped into a complex polling with backoff logic, called by an Event every 5 minutes from Eventbridge.
Now it is all handled by the isCompleteHandlers
and onEventHandlers
of the Provider
class. The following code snippet illustrates the implementation:
this.provider = new cr.Provider(this, 'rootmail-autowire-dns-provider', {
isCompleteHandler: isCompleteHandlerFunc,
queryInterval: Duration.seconds(5),
totalTimeout: Duration.minutes(20),
onEventHandler: onEventHandlerFunc,
});
I also adopted the use of cdk-nag
to ensure that our code adheres to AWS best practices and security guidelines. This was achieved by adding cdk-nag
as a dev dependency and running npx cdk-nag
in the root directory of the project.
Conclusion
Diving into AWS, my journey from terraform
to gaining proficiency in CDK
has been a notable learning curve, enriched by first-time development experiences with ChatGPT and GitHub Copilot. Handling, especially integration tests, pushed me to delve deep into design and effective testing methodologies, despite facing several hurdles with cdk-nag
suppressions. A nod to Superluminar for laying the groundwork on Custom Resources (CR) in TypeScript - your efforts have helped this project significantly.
Highly recommending cdk-app-review from Thorsten Höger - it’s a resource that provides substantial insights and could be a game-changer for your CDK projects. Also the ChatGPT workshops from Cristian Măgherușan-Stanciu which helped me a lot to use the right prompts for the chatbot. And not to forget Matthew Bonig’s advanced cdk course.
Still on the to-do list:
- Implementing it in GitHub Actions, following this guide. 🛑
- Migrating to aws-sdk-js-v3 ✅
- Figuring out a method to run the test stack exclusively for faster feedback and not using Lambda log debugging. ✅
- Fixing the
cdk-nag
errors and warning and/or add appropriate suppressions. ✅
This endeavor was not just a technical deep dive, but also a glimpse into the continually evolving AWS landscape, emphasizing the imperative of continuous learning and adaptation in cloud computing and development.
Like what you read? You can hire me 💻, book a meeting 📆 or drop me a message to see which services may help you 👇