Microservice Debugging With Lambdas: A Detective Story

Sep 20, 2021 | By Calvin Ton

Here at FloQast, we like to use AWS Lambdas. It's great for scaling modular development and comes with a lot of great tooling under the AWS ecosystem including layers for shared code dependencies, CloudWatch for monitoring, or X-Ray for distributed tracing (see this post or this one for more info). One thing to note about setting up Lambdas is configuring timeouts.

Situation

The diagram below lays out a situation similar to one that we came across using Lambdas.

We have a queueing system that maps queue message insertions to invocations of corresponding Lambdas. As a message gets added to the queue, it gets picked up and its payload is passed into the Lambda using AWS SDK. Once a message has been picked up for processing, a "visibility window" is set on the message. In the case this message needs to be retried, it will not be until this window has passed. More information regarding this queueing architecture can be found in this previous post.

This queue is configured for a 6.5-minute visibility, the same as the timeout set on Lambda 1. This is equal to the max total time of any given job, aggregating the max time required for each downstream Lambda. Each Lambda is configured with its own corresponding timeout for the estimated max duration of its workload. This is except for Lambda 4 (which takes about a minute) with a max 15-minute timeout because why not?.. but more on this later.

The Lambda 1 instance of the SDK is set for a 3-minute timeout, covering its direct invocations of Lambdas 2, 3, and 5. There are logs in place to signal when each Lambda starts and succeeds/fails. We expect these logs to appear in the order of invocation with all downstream Lambdas logging their completion before their upstream Lambda (upstream has to wait for downstream to finish executing).

This system was tested by adding messages into the queue and watching as messages get pulled and processed. Things were going smoothly as expected until we saw a sudden spike in concurrent executions of each Lambda. Considering the max concurrent jobs to be processed by the queue is set to 25, one would expect any given Lambda in this system would have at most 25 concurrent executions. Some Lambdas, however, were nearing 80 concurrent executions.

A Visit To The Good Ol' Logs

We started by tracing logs for jobs that were in a bad state to see if things were executed in the expected order of operations.

09:00:00 Lambda 1 started for Company A...
09:00:03 Lambda 2 started for Company A...
09:00:37 Lambda 2 completed for Company A...
09:00:48 Lambda 3 started for Company A...
09:00:52 Lambda 4 started for Company A...

Looks like the message was successfully picked up from the queue and Lambda 1 gets kicked off. Nice! Then the first downstream Lambdas are being invoked in the order as expected... Looks good. Looks good. Let's keep goin'.

09:00:48 Lambda 3 started for Company A...
09:00:52 Lambda 4 started for Company A...
09:01:46 Lambda 4 completed for Company A...
09:01:51 Lambda 3 completed for Company A...
09:01:55 Lambda 5 started for Company A...
09:02:00 Lambda 1 started for Company A...
09:02:04 Lambda 2 started for Company A...

Looks like the next downstream Lambdas get invoked. Looks good.. Then the job gets kicked off by Lambda 1 again.. Looks.. wait, what?

The visibility window was correctly flagged on the message and if it was picked up at 09:00:00, the earliest subsequent retry shouldn't start until at least 09:06:30. Furthermore, the max retries for messages on this queue are set explicitly to 0. The message should have never been retried to begin with. Upon further inspection of the logs, there were several instances where the same process was kicked off again by Lambda 1 exactly every 2 minutes up to 3 times.

09:00:00 Lambda 1 started for Company A...
09:00:03 Lambda 2 started for Company A...
...
09:01:55 Lambda 5 started for Company A...
09:02:00 Lambda 1 started for Company A...
...
09:04:00 Lambda 1 started for Company A...
...
09:06:00 Lambda 1 started for Company A...

We spent time tracing through logs and queue message records for several different jobs finding similar behavior.

Discovery

The exactness of these 2-minute gaps led us to believe this issue wasn't due to the queue message being pulled again. Rather, the initial SDK invocation of Lambda 1 in the job used the default timeout of 2 minutes with 3 automatic retries, ignoring the timeout set directly on Lambda 1 and the max retries set on the job.

The downstream Lambdas were still running while Lambda 1 "times out" after 2 minutes and automatically retries through the SDK, restarts processing the same message, calling the same downstream Lambdas again. As such, we sometimes ended up with 4x the amount of Lambda invocations as expected. This just required a minor configuration update of the SDK.

// maxRetries: retry count / timeout: socket timeout / connectTimeout: new connection timeout

var AWS = require('aws-sdk');

AWS.config.update({
    maxRetries: 0,
    httpOptions: {
        timeout: 390000,
        connectTimeout: 5000
    }
});

Notes To Leave With

A Lambda's timeout must be set to at least the sum of all downstream Lambda timeouts (with some cushion for cold starts etc.). In this case, fixing the SDK invocation of Lambda 1 will still leave this system susceptible to Lambda 4's execution time exceeding that of Lambda 3.

Even if a Lambda is configured with the correct timeout, the invocation of the Lambda also needs to account for that timeout. This can be configured on the SDK. However, if a Lambda is invoked via an HTTP request to API Gateway, the default time limit is 30 seconds. This limit is not configurable at the moment.

All in all, there are tons of great use cases for Lambdas, but they also come with their limitations and configuration challenges. Here's a great resource outlining some of these pros/cons. Definitely take these into consideration when evaluating if they fit your use case. With that, I wish you, "Good Luck" and "Happy Scaling!"

Calvin Ton
Software Engineer at FloQast. Musichead and avid cratedigger.

Check out research, videos, case studies, and more!

Learn more about working at FloQast!