The awscdk-rootmail construct
Photo by unsplash

Why

Formerly in superwerker (I was contributing to) we built the so-called rootmail feature (see its ADR). In nutshell

Each AWS account needs one unique email address (the so-called “AWS account root user email address”).

Access to these email addresses must be adequately secured since they provide privileged access to AWS accounts, such as account deletion procedures.

This is why you only need 1 mailing list for the AWS Management (formerly root) account: we recommend aws-roots+<uuid>@mycompany.test we recommend aws-roots+<uuid>@mycompany.test

NOTE: maximum 64 character are allowed for the whole address. And as you own the domain mycompany.test you can add a subdomain, e.g. aws, for which all E-Mails will then be received with this solution within this particular AWS Management account.

It was only available in CloudFormation and I wanted to migrate it to cdk to learn more about it and to make it available for the community. So let’s start and I share with you the journey

The challenge

After having completed several cdk courses (e.g. this one, or the offical one) I wanted to apply my knowledge and build a construct according to best practices.

The solution

First, let take a look a the final solitions architecture, and the let’s dive into the way of getting there

Local testing

First without the integ test, which I’ll explain later.

  1. Create a seprate project
  2. add the dependency in the .projenrc.ts file and run npm run projen to import it
import { awscdk } from 'projen';
const project = new awscdk.AwsCdkTypeScriptApp({
  // other settings
  deps: [
    'file:../awscdk-rootmail', 
  ],
});

However I got the error Types have separate declarations of a private property 'host'.ts(2345) this: this

export class MyStack extends Stack {
  constructor(scope: Construct, id: string, props: StackProps = {}) {
    super(scope, id, props);

    // this param 👇 caused the issue
    new Rootmail(this, 'testRootmail', {
      domain: 'mavogel.xyz', // my testing domain 😊
    });
  }
}

After some research I found out that the cdk team is aware of that issue. I also found someone asking a similar question in the cdk-slack and Matthew Bonig writing a blog post about it as well. This did not work for me, so I came up with a more simple solution

$ rm -rf node_modules/awscdk-rootmail/node_modules/constructs/`

Which removed the double occurence of the constructs module in the tree. And then I was able to run npm run deploy successfully.

Note: you might be thinking why is he not using the cdk integ-test module to test his construct? Be patient, I switched to it later on.

Back to the initial attempt, the next error occurred:

Cannot find index file at awscdk-rootmail-test/node_modules/awscdk-rootmail/lib/functions/hosted_zone_dkim_verification_records_cr/index.py

Ok then I thought, let’s migrate the Lambda functions to TypeScript as well.

Rewriting the Lambda functions to TypeScript

I mainly used Philipp Garbe’s post, how he writes/develops Lambda functions in TypeScript and also certain baselines from the superwerker project, like the Custom resource on generate-email-address.ts .

I first kept the converted lambda functions in the functions folder, however they could not be found in the awscdk-rootmail-test/node_modules/awscdk-rootmail/lib/functions folder. So I moved them in a flat folder structure as the following image shows, with the known naming conventions.

However, when trying to deploy, I encountere the next error "Cannot find Package" errors when I run Lambda code in Node.js?

{
    "errorType": "Runtime.ImportModuleError",
    "errorMessage": "Error: Cannot find module 'aws-sdk'\nRequire stack:\n- /var/task/index.js\n- /var/runtime/index.mjs",
    "stack": [
        "Runtime.ImportModuleError: Error: Cannot find module 'aws-sdk'",
        "Require stack:",
        "- /var/task/index.js",
        "- /var/runtime/index.mjs",
        "    at _loadUserApp (file:///var/runtime/index.mjs:997:17)",
        "    at async UserFunction.js.module.exports.load (file:///var/runtime/index.mjs:1029:21)",
        "    at async start (file:///var/runtime/index.mjs:1192:23)",
        "    at async file:///var/runtime/index.mjs:1198:1"
    ]
}

After researching, an AWS repost gave me more clarity by denoting

For Node.js runtimes 16 and earlier, Lambda doesn’t support layered JavaScript ES module dependencies. You must include the dependencies in the deployment. Lambda supports JavaScript ES module dependencies for Node.js 18.

So, I switched to the Node.js 18 runtime for all Lambda functions:

const rootMailReady = new NodejsFunction(this, 'ready-handler', {
  // 👇 solved the layer issue with the 'aws-sdk' dependency
  runtime: lambda.Runtime.NODEJS_18_X,
  environment: {
    DOMAIN: domain,
    SUB_DOMAIN: subdomain,
  },
});

Now it finally worked and it could verify it by sending mails from my GMail address

However the next step will be to have it tested with the cdk integ tests.

Adding cdk integ tests

As mentioned before I wanted proper testing and get away from the local awscdk-rootmail-test project. All the tests are based on the previously existing rootmail_test.py file from the superwerker project. So I started mapping then to TypeScript.

NOTE: the @aws-cdk/integ-tests-alpha package is still alpha state, so I did expect certain things to not work as expected. However, I was able to get it working and I am happy to share my findings with you.

Baseline

My challenges were:

  1. no auto-completion for the awsApiCall method. So going to the docs of aws-sdk-js was the only way to find out what the parameters are, as I don’t know them all be memory 😅 Ok this was basic RTFM of the sdk docs.
const getHostedZoneParametersAssertion = integ.assertions
  /**
  * Check that parameter is present
  */
  .awsApiCall('SSM', 'getParameter', {
    Name: rootmail.hostedZoneParameterName,
  })
  .expect(
    ExpectedResult.objectLike({
      Parameter: {
        Name: rootmail.hostedZoneParameterName,
        Type: 'StringList',
      },
    }),
  );
  1. I could not implemented all the cases with awsApiCall method, especially having multiple aws-sdk call and passing data from one to another. So a more flexible option how to deploy and execute lambda to invoke in tests? I summarized it in the issue of the great cdk-integ-tests-sample GitHub repository. And also looked into the offical aws-cdk test suites, like e.g. the one for api-gateway. The following is a snippet from the integ.rootmail.ts test file. You find the whole code in the linked project at the end of the post.
const closeOpsItemHandler = new NodejsFunction(stackUnderTest, 'close-opsitem-handler', {
  entry: path.join(__dirname, 'functions', 'close-opsitem-handler.ts'),
  runtime: lambda.Runtime.NODEJS_18_X,
  logRetention: 1,
  timeout: Duration.seconds(30),
  initialPolicy: [
    //  policies 👇 go here
    new iam.PolicyStatement({
      actions: [
        'ssm:GetOpsSummary',
        'ssm:UpdateOpsItem',
      ],
      resources: ['*'],
    }),
  ],
});
// ...
const updateOpsItemAssertion = integ.assertions
  .invokeFunction({
    functionName: closeOpsItemHandler.functionName,
    // to be able to 👇 debug
    logType: LogType.TAIL,
    // to run it synchronously  ----- 👇
    invocationType: InvocationType.REQUEST_RESPONE,
    // found this 👇 in the aws-cdk test suite for api-gateway
    payload: JSON.stringify({
      title: id,
    }),
  }).expect(ExpectedResult.objectLike(
    // as the object 'return { closeStatusCode: 200 };'
    // is wrapped in a Payload object with other properties 🙃
    {
      Payload: {
        closeStatusCode: 200,
      },
    },
  ),
  );
  1. The email sending assertion to kick the whole test off was tricky, as I received the following error message:

Received response status [FAILED] from custom resource. Message returned: Email address is not verified. The following identities failed the check in region EU-CENTRAL-1: test@aws-test.mavogel.xyz, root+test-id-1@aws-test.mavogel.xyz

I saw two possible solutions:

  1. either get SES out of sandbox into production mode (in the testing aws account, as receiving works)
  2. or add a verified email address OR domain (see stackoverflow) which was the case for eu-west-1

In the initial solution design, the SES EMail receiver is in eu-west-1 and the error message denotes that the domain verification is in eu-central-1, so I adapted/fixed this in the assertion function by initiating the sdk acccordingly:

const SES = new AWS.SES({ region: 'eu-west-1' });

export const handler = async (event: any) => {
  // ...
}
  1. The next challenge was: how to autowire the DNS setup for testing. I found the following constraint for least privilege IAM policies for Route53 (excerpt from chatgpt):

AWS IAM does not support granular permissions down to the resource record set level within Route53. This means you cannot restrict the ChangeResourceRecordSets permission to only apply to specific record sets (like NS records or certain domain names).

IAM policies can limit permissions to specific hosted zones via the resource ARN, but they cannot get more specific than that within Route53. This is primarily due to the fact that DNS record sets (like NS, A, AAAA, CNAME, etc.) aren’t individual resources with their own ARNs.

If you want to limit the impact of a given IAM role that can change DNS records, you might have to take an application-side approach, such as implementing the checks in the application code itself, in the AWS Lambda function in this case, to prevent changes to other record types or domains.

Ok the lambda function was straight forward, but I encountered the following issue that although the SES recipient rule to S3 and the lambda function was wired correctly, the email neither delivered to the S3 bucket nor the lambda function was invoked. I deployed a separate lambda function manually and it worked as expected. So I was confused. Looking at the S3 bucket for the mails, I realized during my manual testing, that I had 1m30s between the SES_SETUP_NOTIFICATION_MAIL and the actual processed mail. So I added configurable sleep (because we also do not want to have this in the unit test) with a default of 2 minutes. And voila it worked. So it was a race condition issue in SES recipient rule setup. It took me 1 day to find out 😅 however I learned a lot, especially about debugging.

more obstacles

There was a race condition in integ test, as I realized that the cleanupAssertion function, was called sequentially however when the stack was created AND when it was updated. Meaning resouces when cleaned up in the middle of the test run. I understood this when I reallized how the integ-runner works.

// check the parameter store
getHostedZoneParametersAssertion
  // Send a test email
  .next(sendTestEmailAssertion)
  // Validate an OPS item was created.
  .next(validateOpsItemAssertion) // <- 
  // Close the OPS item that was created.
  .next(updateOpsItemAssertion) // <-
  // call teardown 👇 lambda
  .next(cleanupAssertion);

Fow now I implemented the logic that all CR cleanup themselves and for the S3 Bucket containing the mails, I have a separate python script which needs to be run manually. As a new bucket is created per test run with a random suffix it is fine for now to clean it up manually as follows:

import boto3
import sys

def empty_and_delete_bucket(bucket_name):
    s3 = boto3.resource('s3')
    bucket = s3.Bucket(bucket_name)

    # Empty the bucket
    print(f"Emptying bucket: {bucket_name}")
    for obj_version in bucket.object_versions.all():
        obj_version.delete()

    # Delete the bucket
    bucket.delete()
    print(f"Bucket {bucket_name} deleted")

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print("Please provide the bucket name as a parameter.")
        sys.exit(1)
    
    bucket_name = sys.argv[1]
    empty_and_delete_bucket(bucket_name)

As of now for invokeFunction().waitForAssertions() I could not use the polling on a lambda function, as the error shows

2023-09-07T16:12:00.039Z	92341375-fb62-4589-82e8-b0802ea4102c	INFO	AccessDeniedException: User: arn:aws:sts::123456789012:assumed-role/SetupTestDefaultTestDeplo-SingletonFunction76b3e83-MF49XEZ4HA0J/SetupTestDefaultTestDeplo-SingletonFunction76b3e83-6VsSOMHQs31T is not authorized to perform: lambda:InvokeFunction on resource: arn:aws:lambda:eu-west-1:123456789012:function:RootmailTestStack-closeopsitemhandler2F03D32C-U06t2LsB3GQR because no identity-based policy allows the lambda:InvokeFunction action

so I had to implement it by myself. However I am not sure if this is the best way to do it, but it works for now!

Finally

After all the testing was done, I added documentation and approached Thorsten Höger from taimos for a cdk-app-review to learn from the best, on what I could improve. For me, he is one of the best cdk developers I know and he is also a great and patient teacher. He gave me a lot of valuable feedback, which I will incorporate in the next steps.

The cdk-app-review

First things first: I decided to learn from the experts. Thorsten Höger is one of the best cdk developers I know and he is also a great teacher. So I highly recommend to get your cdk project reviewed by him.

Before the review

I used to utilize PhysicalName.GENERATE_IF_NEEDED for certain resource naming conventions, which was associated with issues such as those discussed here.

Parameters were employed only as indexed objects. For further insights on how we managed TypeScript Maps, you may refer to this blog post.

Originally, I used integrated stacksets, transitioning from this discussion to utilizing cdk-stacksets. A notable advantage of using CDK native features is that they inherently possess multi-account and region capabilities, similar to Terraform.

By far one of the feedback was, that the code in the rootmail.ts file was too complex. However I wanted to stick first to the original implementation and then refactor it. So I did not change it before the review.

  1. the rootmailReady function checks for a max of 260s (2m20s) if the SES setup is done / the DNS is wired
const rootMailReady = new NodejsFunction(this, 'ready-handler', {
  runtime: lambda.Runtime.NODEJS_18_X,
  // # the timeout effectivly limits retries to 2^(n+1) - 1 = 9 attempts with backup
  //  as the function is called every 5 minutes from the event rule
  timeout: Duration.seconds(260),
  logRetention: 3,
  environment: {
    DOMAIN: domain,
    SUB_DOMAIN: subdomain,
  },
});

// more code

this.rootMailReadyEventRule = new events.Rule(this, 'RootMailReadyEventRule', {
  schedule: events.Schedule.rate(Duration.minutes(5)),
});
  1. to then if it does not run into the timeout, it will put a Cloudwatch to green
const rootMailReadyAlert = new cw.Alarm(this, 'Errors', {
  alarmName: 'superwerker-RootMailReady',
  comparisonOperator: cw.ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
  metric: new cw.Metric({
    namespace: 'AWS/Lambda',
    metricName: 'Errors',
    period: Duration.seconds(180),
    statistic: 'Sum',
    dimensionsMap: {
      // see the function name 👇
      FunctionName: rootMailReady.functionName,
    },
  }),
  evaluationPeriods: 1,
  threshold: 1,
});
  1. which then triggers the rootMailReadyTrigger function
const rootMailReadyTriggerEventPattern = new events.Rule(this, 'RootMailReadyTriggerEventPattern', {
  eventPattern: {
    detailType: ['CloudWatch Alarm State Change'],
    source: ['aws.cloudwatch'],
    detail: {
      alarmName: [rootMailReadyAlert.alarmName],
      state: {
        value: ['OK'],
      },
    },
  },
});

rootMailReadyTriggerEventPattern.addTarget(new LambdaFunction(rootMailReadyTrigger));
  1. which then triggers the rootMailReadyHandle wait condition for the stack
const rootMailReadyHandle = new CfnWaitConditionHandle(this, 'RootMailReadyHandle');

new CfnWaitCondition(this, 'RootMailReadyHandleWaitCondition', {
  handle: rootMailReadyHandle.ref,
  timeout: totalTimeToWireDNS.toSeconds().toString(),
});

const rootMailReadyTrigger = new NodejsFunction(this, 'ready-trigger-handler', {
  runtime: lambda.Runtime.NODEJS_18_X,
  timeout: Duration.seconds(10),
  logRetention: 3,
  environment: {
    // HTTP POST URL to trigger 👇 the wait condition
    SIGNAL_URL: rootMailReadyHandle.ref,
    ROOTMAIL_READY_EVENTRULE_NAME: this.rootMailReadyEventRule.ruleName,
    AUTOWIRE_DNS_EVENTRULE_NAME: autowireDNSEventRuleName,
  },
});

I have no clue why this was built like back then but it works 😅 and I will refactor it in the next steps. However, rebuilding it in cdk was slight pain, but I learned a lot about the cdk and aws ecosystem, which was fine.

After the review

I opted to generate S3 bucket names as everything is now consolidated into a single stack, enhancing manageability and consistency.

The utilization of grants was adopted because it seamlessly manages permissions, such as automatically handling kms key permissions when required, thereby streamlining access management.

I shifted to using the ?? (nullish coalescing) operator instead of || to ensure that only null or undefined values trigger the use of a default value, thus making our conditionals more robust and accurate.

Instead of manually passing the Route53 hosted zone ID as a parameter, I now retrieve it via lookup, enhancing automation and reducing the potential for human error for a typo.

In order to achieve a more streamlined logic and to harness AWS CDK’s capabilities more effectively, I adopted isCompleteHandlers in AWS CDK custom resources. This was the biggest change in the codebase, as it required a complete rewrite of the custom resource logic. The following code snippet illustrates the implementation of isCompleteHandlers:

const route53 = new Route53();
const ssm = new SSM();

export interface IsCompleteHandlerResponse {
  IsComplete: boolean;
}

export async function handler(event: AWSCDKAsyncCustomResource.OnEventRequest): Promise<IsCompleteHandlerResponse> {
  const hostedZoneParameterName = event.ResourceProperties[PROP_R53_HANGEINFO_ID_PARAMETER_NAME];

  const recordSetCreationResponseChangeInfoIdParam = await ssm.getParameter({
    Name: hostedZoneParameterName,
  }).promise();
  const recordSetCreationResponseChangeInfoId = recordSetCreationResponseChangeInfoIdParam.Parameter?.Value as string;

  log(`got R53 change info id: ${recordSetCreationResponseChangeInfoId} for event type ${event.RequestType}`);
  log({
    msg: 'event',
    event,
  });

  switch (event.RequestType) {
    case 'Create':
      log('waiting for DNS to propagate');
      try {
        // we us the waiter, however with a small delay and only 1 attempt
        // as the  polling logic is handled by the CR itself
        const res = await route53.waitFor('resourceRecordSetsChanged', {
          Id: recordSetCreationResponseChangeInfoId,
          // Note: the default is 30s delay and 60 attempts
          $waiter: {
            delay: 2,
            maxAttempts: 1,
          },
        }).promise();

        if (res.ChangeInfo.Status !== 'INSYNC') {
          log(`DNS propagation not in sync yet. Has status ${res.ChangeInfo.Status}`);
          return { IsComplete: false };
        }

        log(`DNS propagated with status ${res.ChangeInfo.Status}`);
        return { IsComplete: true };
      } catch (e) {
        log(`DNS propagation errored. Has message ${e}`);
        return { IsComplete: false };
      }
    case 'Update':
    case 'Delete':
      return {
        IsComplete: true,
      };
  }
}

function log(msg: any) {
  console.log(JSON.stringify(msg));
}

which is then plugged into the provider and called interatively. Meaning, this simplified the whole polling which before was wrapped into a complex polling with backoff logic, called by an Event every 5 minutes from Eventbridge.

Now it is all handled by the isCompleteHandlers and onEventHandlers of the Provider class. The following code snippet illustrates the implementation:

this.provider = new cr.Provider(this, 'rootmail-autowire-dns-provider', {
  isCompleteHandler: isCompleteHandlerFunc,
  queryInterval: Duration.seconds(5),
  totalTimeout: Duration.minutes(20),
  onEventHandler: onEventHandlerFunc,
});

I also adopted the use of cdk-nag to ensure that our code adheres to AWS best practices and security guidelines. This was achieved by adding cdk-nag as a dev dependency and running npx cdk-nag in the root directory of the project.

Conclusion

Diving into AWS, my journey from terraform to gaining proficiency in CDK has been a notable learning curve, enriched by first-time development experiences with ChatGPT and GitHub Copilot. Handling, especially integration tests, pushed me to delve deep into design and effective testing methodologies, despite facing several hurdles with cdk-nag suppressions. A nod to Superluminar for laying the groundwork on Custom Resources (CR) in TypeScript - your efforts have helped this project significantly.

Highly recommending cdk-app-review from Thorsten Höger - it’s a resource that provides substantial insights and could be a game-changer for your CDK projects. Also the ChatGPT workshops from Cristian Măgherușan-Stanciu which helped me a lot to use the right prompts for the chatbot. And not to forget Matthew Bonig’s advanced cdk course.

Still on the to-do list:

  • Implementing it in GitHub Actions, following this guide. 🛑
  • Migrating to aws-sdk-js-v3
  • Figuring out a method to run the test stack exclusively for faster feedback and not using Lambda log debugging. ✅
  • Fixing the cdk-nag errors and warning and/or add appropriate suppressions. ✅

This endeavor was not just a technical deep dive, but also a glimpse into the continually evolving AWS landscape, emphasizing the imperative of continuous learning and adaptation in cloud computing and development.

Like what you read? You can hire me 💻, book a meeting 📆 or drop me a message to see which services may help you 👇