Skip to content

Send errors metrics for 5xx response from API Gateway, Lambda Function URL, or ALB #229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

kimi-p
Copy link
Contributor

@kimi-p kimi-p commented Jun 21, 2022

What does this PR do?

  • Send aws.lambda.enhanced.errors when the Lambda invocation is from AWS API Gateway, AWS Lambda Function URL, or ALB.
  • Sets span.error=1.
  • Reset self.response to None in _before so that the response from last invocation wouldn't carry over into the current invocation.

Motivation

When using a web framework, and something went wrong with a request, instead of raising an exception, the web framework typically returns a 5xx response back to the client. From user's perspective, invocations returning a 5xx response code (aka, the value of the statusCode field from the response is 5xx) should be reported as errors in Datadog.

Testing Guidelines

image

image

Additional Notes

To avoid sending duplicated aws.lambda.enhanced.errors metric, self.already_submitted_errors_metric_before is added.

Types of Changes

  • Bug fix
  • New feature
  • Breaking change
  • Misc (docs, refactoring, dependency upgrade, etc.)

Check all that apply

  • This PR's description is comprehensive
  • This PR contains breaking changes that are documented in the description
  • This PR introduces new APIs or parameters that are documented and unlikely to change in the foreseeable future
  • This PR impacts documentation, and it has been updated (or a ticket has been logged)
  • This PR's changes are covered by the automated tests
  • This PR collects user input/sensitive content into Datadog
  • This PR passes the integration tests (ask a Datadog member to run the tests)

@@ -190,6 +192,9 @@ def _after(self, event, context):
status_code = extract_http_status_code_tag(self.trigger_tags, self.response)
if status_code:
self.trigger_tags["http.status_code"] = status_code
if not self.already_submitted_errors_metric_before:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better way to solve this problem is to reset self.response to None in _before (or in _call before the try block). This way if the invocation fails, self.response would still be None and you won't have a status_code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, in your current implementation already_submitted_errors_metric_before never resets to False once set True. __init__ only gets called once on cold start. You need to reset it in _before (or in _call before the try block)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. self.response is already initialized to None here:

self.response = None
.

And when the exception happens, self.response will still be None. So, we actually do not need the flag (self. already_submitted_errors_metric_before ) to avoid sending errors metrics twice since the statusCode will be None.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's initialized to None, but it doesn't get reset to None between invocations. E.g., you have a good invocation, which would set self.response, then you have a failed invocation, self.response won't reset to None. extract_http_status_code_tag would extract the status code from the last good invocation instead of from the current failed invocation.

Even putting your use case aside, the state of self.response probably should be reset to None on before every invocation, keeping the invocation response value from the previous invocation in the current invocation is meaningless and can only lead to bugs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe one context you missed here is that Lambda isn't truly stateless, a sandbox/container is reused for many invocations until it gets recycled by AWS. https://pfisterer.dev/posts/aws-lambda-container-reuse might be a good read?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing the post! Just want to be sure (since the post mentioned only about shared folder -- It's possible that a container is reused, which causes the /tmp folder to be shared between invocations.), does the singleton wrapper got reused between multiple invocations?
datadog_lambda_wrapper = _LambdaDecorator

I plan to reset self.response at the beginning of _before() anyway. Thanks for calling this out!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, a new container starts up on first invocation, your python app get loaded into memory, that's called cold start, then from there, the container will be reused for subsequent invocations, until AWS recycles it, or AWS needs to launch more containers to handle more requests (one lambda container can only handle one request/invocation at a time)

@@ -190,6 +190,10 @@ def _after(self, event, context):
status_code = extract_http_status_code_tag(self.trigger_tags, self.response)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replied to the original conversation, but want to point out, in case of invocation fails, self.response would still hold the value from the previous good invocation, and you would end up emitting an error metric based on the previous invocation instead of the current one.

if len(status_code) == 3 and status_code.startswith("5"):
submit_errors_metric(context)
if self.span:
self.span.set_traceback()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you have a valid stacktrace in this case, since there isn't a real python exception being thrown. I suspect you need to directly set the error message and type fields https://github.com/DataDog/dd-trace-py/blob/fb8dfa2f33fff37d21df9728d8386c0260df9744/ddtrace/contrib/grpc/server_interceptor.py#L41-L42

We should set both fields to something meaningful that can help users understand the problem when seeing it. For error.type, we can use StatusCode 5xx, for error.msg we can probably say Lambda invocation returns status code 500 (we can probably spell out the actual status code in the message instead of 5xx).

@brettlangdon we are trying to mark Lambda invocations returning statusCode 5xx as errors in trace, is what I mentioned above the best way to do it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we do in the tracer: https://github.com/DataDog/dd-trace-py/blob/3b91b0da8/ddtrace/contrib/trace_utils.py#L274-L282

tl;dr; you just need to do self.span.error = 1

@@ -150,6 +153,7 @@ def __call__(self, event, context, **kwargs):
self._after(event, context)

def _before(self, event, context):
self.response = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's put this line inside the try:... block, so we avoid datadog code crashing customer application as much as we can....that is, datadog should fail quietly (only log something) but no interruption to customer application. I know this line is pretty safe, but it may "invite" future developers to add more lines outside the try block because they saw some lines outside...

self.span.set_traceback()
self.span.error = 1
self.span.set_tags({
ERROR_TYPE: "5xx Server Errors",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is typically left for the Python class of the error.

e.g.

try:
    raise ValueError("oh no")
except Exception as e:
    error_type = str(type(e))  # "ValueError"

Similar for ERROR_MSG, it is meant to be the traceback.

It might be better to avoid these just because the backend/UI is expecting specific values.

Do we also tag this span with the http.status_code ?

Copy link
Contributor Author

@kimi-p kimi-p Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Brett,

I am new to the team so might not be 100% correct, but I think we want to mark 5xx responses as "Error" even though the program does not throw any exception. The intended result UI I think looks like the first graph. So that's why I'm setting the error.type and error.msg here (according the 2nd picture from the datadog doc).

image

image

Do we also tag this span with the http.status_code ?

Yes, we do here right before the logic I added:

self.trigger_tags["http.status_code"] = status_code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically when we mark a span as an error because of the status code we don't set a traceback (e.g. don't we don't call span.set_traceback() because there isn't an exception/traceback associated with the "failure").

it is ok to only set span.error = 1, as long as we have the http.status_code tag also set on the span (thanks for clarifying, I couldn't tell if trigger_tags get added to the span or not) then the UI knows what to do with it.

But this is probably more of a product decision, do we want to add a traceback for these specific 500 responses? Do they provide value for customers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also 👋🏻 welcome!!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding trigger_tags, can't be 100% sure but I think we did set it here:

if self.make_inferred_span:
self.inferred_span = create_inferred_span(event, context)
self.span = create_function_execution_span(
context,
self.function_name,
is_cold_start(),
trace_context_source,
self.merge_xray_traces,
self.trigger_tags,

I will check with the team about the traceback. Thanks for the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: we're only setting self.error=1 and not setting the error.type and error.msg now. Frontend will handle like what APM does. The PR description is updated to reflect the change and screenshots are also updated.

@kimi-p kimi-p marked this pull request as ready for review June 24, 2022 15:03
@kimi-p kimi-p requested a review from a team as a code owner June 24, 2022 15:03
@kimi-p kimi-p changed the title Send errors metrics for 5xx reponse from API Gateway, Lambda Function URL, or ALB Send errors metrics for 5xx response from API Gateway, Lambda Function URL, or ALB Jun 24, 2022
@@ -42,3 +42,13 @@ class XrayDaemon(object):
XRAY_TRACE_ID_HEADER_NAME = "_X_AMZN_TRACE_ID"
XRAY_DAEMON_ADDRESS = "AWS_XRAY_DAEMON_ADDRESS"
FUNCTION_NAME_HEADER_NAME = "AWS_LAMBDA_FUNCTION_NAME"


SERVER_ERRORS_STATUS_CODES = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably don't need them any longer.

Copy link
Collaborator

@tianchu tianchu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@kimi-p kimi-p merged commit ef9f18c into main Jun 27, 2022
@kimi-p kimi-p deleted the kimi/sls-1775.send_enhanced_errors_metrics_for_5xx_on_python_library branch June 27, 2022 19:32
kimi-p added a commit that referenced this pull request Jun 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants