Send errors metrics for 5xx response from API Gateway, Lambda Function URL, or ALB #229

kimi-p · 2022-06-21T21:19:59Z

What does this PR do?

Send aws.lambda.enhanced.errors when the Lambda invocation is from AWS API Gateway, AWS Lambda Function URL, or ALB.
Sets span.error=1.
Reset self.response to None in _before so that the response from last invocation wouldn't carry over into the current invocation.

Motivation

When using a web framework, and something went wrong with a request, instead of raising an exception, the web framework typically returns a 5xx response back to the client. From user's perspective, invocations returning a 5xx response code (aka, the value of the statusCode field from the response is 5xx) should be reported as errors in Datadog.

Testing Guidelines

Additional Notes

To avoid sending duplicated aws.lambda.enhanced.errors metric, self.already_submitted_errors_metric_before is added.

Types of Changes

Bug fix
New feature
Breaking change
Misc (docs, refactoring, dependency upgrade, etc.)

Check all that apply

This PR's description is comprehensive
This PR contains breaking changes that are documented in the description
This PR introduces new APIs or parameters that are documented and unlikely to change in the foreseeable future
This PR impacts documentation, and it has been updated (or a ticket has been logged)
This PR's changes are covered by the automated tests
This PR collects user input/sensitive content into Datadog
This PR passes the integration tests (ask a Datadog member to run the tests)

… URL, or ALB

tianchu · 2022-06-21T21:59:36Z

datadog_lambda/wrapper.py

@@ -190,6 +192,9 @@ def _after(self, event, context):
            status_code = extract_http_status_code_tag(self.trigger_tags, self.response)
            if status_code:
                self.trigger_tags["http.status_code"] = status_code
+                if not self.already_submitted_errors_metric_before:


I think a better way to solve this problem is to reset self.response to None in _before (or in _call before the try block). This way if the invocation fails, self.response would still be None and you won't have a status_code.

BTW, in your current implementation already_submitted_errors_metric_before never resets to False once set True. __init__ only gets called once on cold start. You need to reset it in _before (or in _call before the try block)

Nice catch. self.response is already initialized to None here:

datadog-lambda-python/datadog_lambda/wrapper.py

Line 112 in de5aba0

self.response = None

.

And when the exception happens, self.response will still be None. So, we actually do not need the flag (self. already_submitted_errors_metric_before ) to avoid sending errors metrics twice since the statusCode will be None.

Yes, it's initialized to None, but it doesn't get reset to None between invocations. E.g., you have a good invocation, which would set self.response, then you have a failed invocation, self.response won't reset to None. extract_http_status_code_tag would extract the status code from the last good invocation instead of from the current failed invocation.

Even putting your use case aside, the state of self.response probably should be reset to None on before every invocation, keeping the invocation response value from the previous invocation in the current invocation is meaningless and can only lead to bugs.

Maybe one context you missed here is that Lambda isn't truly stateless, a sandbox/container is reused for many invocations until it gets recycled by AWS. https://pfisterer.dev/posts/aws-lambda-container-reuse might be a good read?

Thanks for sharing the post! Just want to be sure (since the post mentioned only about shared folder -- It's possible that a container is reused, which causes the /tmp folder to be shared between invocations.), does the singleton wrapper got reused between multiple invocations?
datadog_lambda_wrapper = _LambdaDecorator

I plan to reset self.response at the beginning of _before() anyway. Thanks for calling this out!

yes, a new container starts up on first invocation, your python app get loaded into memory, that's called cold start, then from there, the container will be reused for subsequent invocations, until AWS recycles it, or AWS needs to launch more containers to handle more requests (one lambda container can only handle one request/invocation at a time)

datadog_lambda/wrapper.py

…back on span

tianchu · 2022-06-22T19:36:17Z

datadog_lambda/wrapper.py

@@ -190,6 +190,10 @@ def _after(self, event, context):
            status_code = extract_http_status_code_tag(self.trigger_tags, self.response)


Replied to the original conversation, but want to point out, in case of invocation fails, self.response would still hold the value from the previous good invocation, and you would end up emitting an error metric based on the previous invocation instead of the current one.

tianchu · 2022-06-22T19:50:32Z

datadog_lambda/wrapper.py

+                if len(status_code) == 3 and status_code.startswith("5"):
+                    submit_errors_metric(context)
+                    if self.span:
+                        self.span.set_traceback()


I'm not sure you have a valid stacktrace in this case, since there isn't a real python exception being thrown. I suspect you need to directly set the error message and type fields https://github.com/DataDog/dd-trace-py/blob/fb8dfa2f33fff37d21df9728d8386c0260df9744/ddtrace/contrib/grpc/server_interceptor.py#L41-L42

We should set both fields to something meaningful that can help users understand the problem when seeing it. For error.type, we can use StatusCode 5xx, for error.msg we can probably say Lambda invocation returns status code 500 (we can probably spell out the actual status code in the message instead of 5xx).

@brettlangdon we are trying to mark Lambda invocations returning statusCode 5xx as errors in trace, is what I mentioned above the best way to do it?

This is what we do in the tracer: https://github.com/DataDog/dd-trace-py/blob/3b91b0da8/ddtrace/contrib/trace_utils.py#L274-L282

tl;dr; you just need to do self.span.error = 1

datadog_lambda/wrapper.py

tianchu · 2022-06-23T19:31:01Z

datadog_lambda/wrapper.py

@@ -150,6 +153,7 @@ def __call__(self, event, context, **kwargs):
            self._after(event, context)

    def _before(self, event, context):
+        self.response = None


let's put this line inside the try:... block, so we avoid datadog code crashing customer application as much as we can....that is, datadog should fail quietly (only log something) but no interruption to customer application. I know this line is pretty safe, but it may "invite" future developers to add more lines outside the try block because they saw some lines outside...

brettlangdon · 2022-06-23T20:04:07Z

datadog_lambda/wrapper.py

+                        self.span.set_traceback()
+                        self.span.error = 1
+                        self.span.set_tags({
+                            ERROR_TYPE: "5xx Server Errors",


This is typically left for the Python class of the error.

e.g.

try: raise ValueError("oh no") except Exception as e: error_type = str(type(e)) # "ValueError"

Similar for ERROR_MSG, it is meant to be the traceback.

It might be better to avoid these just because the backend/UI is expecting specific values.

Do we also tag this span with the http.status_code ?

Hey Brett,

I am new to the team so might not be 100% correct, but I think we want to mark 5xx responses as "Error" even though the program does not throw any exception. The intended result UI I think looks like the first graph. So that's why I'm setting the error.type and error.msg here (according the 2nd picture from the datadog doc).

Do we also tag this span with the http.status_code ?

Yes, we do here right before the logic I added:

datadog-lambda-python/datadog_lambda/wrapper.py

Line 195 in d02855a

self.trigger_tags["http.status_code"] = status_code

Typically when we mark a span as an error because of the status code we don't set a traceback (e.g. don't we don't call span.set_traceback() because there isn't an exception/traceback associated with the "failure").

it is ok to only set span.error = 1, as long as we have the http.status_code tag also set on the span (thanks for clarifying, I couldn't tell if trigger_tags get added to the span or not) then the UI knows what to do with it.

But this is probably more of a product decision, do we want to add a traceback for these specific 500 responses? Do they provide value for customers?

(also 👋🏻 welcome!!)

Regarding trigger_tags, can't be 100% sure but I think we did set it here:

datadog-lambda-python/datadog_lambda/wrapper.py

Lines 173 to 181 in a7cb6ef

if self.make_inferred_span:

self.inferred_span = create_inferred_span(event, context)

self.span = create_function_execution_span(

context,

self.function_name,

is_cold_start(),

trace_context_source,

self.merge_xray_traces,

self.trigger_tags,

I will check with the team about the traceback. Thanks for the comment.

Update: we're only setting self.error=1 and not setting the error.type and error.msg now. Frontend will handle like what APM does. The PR description is updated to reflect the change and screenshots are also updated.

tianchu · 2022-06-24T15:23:35Z

datadog_lambda/constants.py

@@ -42,3 +42,13 @@ class XrayDaemon(object):
    XRAY_TRACE_ID_HEADER_NAME = "_X_AMZN_TRACE_ID"
    XRAY_DAEMON_ADDRESS = "AWS_XRAY_DAEMON_ADDRESS"
    FUNCTION_NAME_HEADER_NAME = "AWS_LAMBDA_FUNCTION_NAME"
+
+
+SERVER_ERRORS_STATUS_CODES = {


you probably don't need them any longer.

tianchu

💯

… Function URL, or ALB (#229)" This reverts commit ef9f18c.

Send errors metrics for 5xx reponse from API Gateway, Lambda Function…

de5aba0

… URL, or ALB

tianchu reviewed Jun 21, 2022

View reviewed changes

datadog_lambda/wrapper.py Outdated Show resolved Hide resolved

remove self.already_submitted_errors_metric_before flag and set_trace…

ad09cc9

…back on span

tianchu reviewed Jun 22, 2022

View reviewed changes

kimi-p added 3 commits June 23, 2022 14:24

set tags and mark span.error=1

f5e80a8

reset self.response and lint

29f4117

tlint

59b1f69

kimi-p commented Jun 23, 2022

View reviewed changes

datadog_lambda/wrapper.py Outdated Show resolved Hide resolved

lint

d60572b

tianchu reviewed Jun 23, 2022

View reviewed changes

brettlangdon reviewed Jun 23, 2022

View reviewed changes

kimi-p added 5 commits June 23, 2022 16:09

lint and move self.response=None into try block

d02855a

lint_again

a7cb6ef

refactor into a function

f71679c

remove set_tags logic

9730be5

lint

43fad9d

kimi-p marked this pull request as ready for review June 24, 2022 15:03

kimi-p requested a review from a team as a code owner June 24, 2022 15:03

kimi-p changed the title ~~Send errors metrics for 5xx reponse from API Gateway, Lambda Function URL, or ALB~~ Send errors metrics for 5xx response from API Gateway, Lambda Function URL, or ALB Jun 24, 2022

tianchu reviewed Jun 24, 2022

View reviewed changes

remove SERVER_ERRORS_STATUS_CODES

abcf63c

tianchu approved these changes Jun 27, 2022

View reviewed changes

kimi-p merged commit ef9f18c into main Jun 27, 2022

kimi-p deleted the kimi/sls-1775.send_enhanced_errors_metrics_for_5xx_on_python_library branch June 27, 2022 19:32

kimi-p added a commit that referenced this pull request Jun 27, 2022

Revert "Send errors metrics for 5xx response from API Gateway, Lambda…

6aee147

… Function URL, or ALB (#229)" This reverts commit ef9f18c.

kimi-p mentioned this pull request Jun 27, 2022

Revert "Send errors metrics for 5xx response from API Gateway, Lambda Function URL, or ALB" #234

Closed

kimi-p mentioned this pull request Jul 7, 2022

[serverless] Treat 5xx response span as errors for Lambda-Extension DataDog/datadog-agent#12659

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Send errors metrics for 5xx response from API Gateway, Lambda Function URL, or ALB #229

Send errors metrics for 5xx response from API Gateway, Lambda Function URL, or ALB #229

kimi-p commented Jun 21, 2022 •

edited

Loading

tianchu Jun 21, 2022

tianchu Jun 21, 2022

kimi-p Jun 22, 2022

tianchu Jun 22, 2022

tianchu Jun 22, 2022

kimi-p Jun 23, 2022

tianchu Jun 23, 2022

tianchu Jun 22, 2022

tianchu Jun 22, 2022

brettlangdon Jun 22, 2022

tianchu Jun 23, 2022

brettlangdon Jun 23, 2022

kimi-p Jun 23, 2022 •

edited

Loading

brettlangdon Jun 23, 2022

brettlangdon Jun 23, 2022

kimi-p Jun 23, 2022

kimi-p Jun 24, 2022

tianchu Jun 24, 2022

tianchu left a comment

		@@ -190,6 +190,10 @@ def _after(self, event, context):
		status_code = extract_http_status_code_tag(self.trigger_tags, self.response)

	if self.make_inferred_span:
	self.inferred_span = create_inferred_span(event, context)
	self.span = create_function_execution_span(
	context,
	self.function_name,
	is_cold_start(),
	trace_context_source,
	self.merge_xray_traces,
	self.trigger_tags,

Send errors metrics for 5xx response from API Gateway, Lambda Function URL, or ALB #229

Send errors metrics for 5xx response from API Gateway, Lambda Function URL, or ALB #229

Conversation

kimi-p commented Jun 21, 2022 • edited Loading

What does this PR do?

Motivation

Testing Guidelines

Additional Notes

Types of Changes

Check all that apply

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kimi-p Jun 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tianchu left a comment

Choose a reason for hiding this comment

kimi-p commented Jun 21, 2022 •

edited

Loading

kimi-p Jun 23, 2022 •

edited

Loading