-
Notifications
You must be signed in to change notification settings - Fork 45
Send errors metrics for 5xx response from API Gateway, Lambda Function URL, or ALB #229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Send errors metrics for 5xx response from API Gateway, Lambda Function URL, or ALB #229
Conversation
datadog_lambda/wrapper.py
Outdated
@@ -190,6 +192,9 @@ def _after(self, event, context): | |||
status_code = extract_http_status_code_tag(self.trigger_tags, self.response) | |||
if status_code: | |||
self.trigger_tags["http.status_code"] = status_code | |||
if not self.already_submitted_errors_metric_before: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a better way to solve this problem is to reset self.response
to None
in _before
(or in _call
before the try
block). This way if the invocation fails, self.response
would still be None
and you won't have a status_code
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, in your current implementation already_submitted_errors_metric_before
never resets to False
once set True
. __init__
only gets called once on cold start. You need to reset it in _before
(or in _call
before the try
block)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. self.response
is already initialized to None
here:
self.response = None |
And when the exception happens, self.response
will still be None
. So, we actually do not need the flag (self. already_submitted_errors_metric_before
) to avoid sending errors metrics twice since the statusCode
will be None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's initialized to None, but it doesn't get reset to None between invocations. E.g., you have a good invocation, which would set self.response, then you have a failed invocation, self.response won't reset to None. extract_http_status_code_tag would extract the status code from the last good invocation instead of from the current failed invocation.
Even putting your use case aside, the state of self.response probably should be reset to None on before every invocation, keeping the invocation response value from the previous invocation in the current invocation is meaningless and can only lead to bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe one context you missed here is that Lambda isn't truly stateless, a sandbox/container is reused for many invocations until it gets recycled by AWS. https://pfisterer.dev/posts/aws-lambda-container-reuse might be a good read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing the post! Just want to be sure (since the post mentioned only about shared folder -- It's possible that a container is reused, which causes the /tmp folder to be shared between invocations.
), does the singleton wrapper got reused between multiple invocations?
datadog_lambda_wrapper = _LambdaDecorator
I plan to reset self.response
at the beginning of _before()
anyway. Thanks for calling this out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, a new container starts up on first invocation, your python app get loaded into memory, that's called cold start, then from there, the container will be reused for subsequent invocations, until AWS recycles it, or AWS needs to launch more containers to handle more requests (one lambda container can only handle one request/invocation at a time)
@@ -190,6 +190,10 @@ def _after(self, event, context): | |||
status_code = extract_http_status_code_tag(self.trigger_tags, self.response) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replied to the original conversation, but want to point out, in case of invocation fails, self.response
would still hold the value from the previous good invocation, and you would end up emitting an error metric based on the previous invocation instead of the current one.
datadog_lambda/wrapper.py
Outdated
if len(status_code) == 3 and status_code.startswith("5"): | ||
submit_errors_metric(context) | ||
if self.span: | ||
self.span.set_traceback() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure you have a valid stacktrace in this case, since there isn't a real python exception being thrown. I suspect you need to directly set the error message and type fields https://github.com/DataDog/dd-trace-py/blob/fb8dfa2f33fff37d21df9728d8386c0260df9744/ddtrace/contrib/grpc/server_interceptor.py#L41-L42
We should set both fields to something meaningful that can help users understand the problem when seeing it. For error.type, we can use StatusCode 5xx
, for error.msg we can probably say Lambda invocation returns status code 500
(we can probably spell out the actual status code in the message instead of 5xx).
@brettlangdon we are trying to mark Lambda invocations returning statusCode 5xx as errors in trace, is what I mentioned above the best way to do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what we do in the tracer: https://github.com/DataDog/dd-trace-py/blob/3b91b0da8/ddtrace/contrib/trace_utils.py#L274-L282
tl;dr; you just need to do self.span.error = 1
datadog_lambda/wrapper.py
Outdated
@@ -150,6 +153,7 @@ def __call__(self, event, context, **kwargs): | |||
self._after(event, context) | |||
|
|||
def _before(self, event, context): | |||
self.response = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's put this line inside the try:...
block, so we avoid datadog code crashing customer application as much as we can....that is, datadog should fail quietly (only log something) but no interruption to customer application. I know this line is pretty safe, but it may "invite" future developers to add more lines outside the try block because they saw some lines outside...
datadog_lambda/wrapper.py
Outdated
self.span.set_traceback() | ||
self.span.error = 1 | ||
self.span.set_tags({ | ||
ERROR_TYPE: "5xx Server Errors", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is typically left for the Python class of the error.
e.g.
try:
raise ValueError("oh no")
except Exception as e:
error_type = str(type(e)) # "ValueError"
Similar for ERROR_MSG, it is meant to be the traceback.
It might be better to avoid these just because the backend/UI is expecting specific values.
Do we also tag this span with the http.status_code
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Brett,
I am new to the team so might not be 100% correct, but I think we want to mark 5xx responses as "Error" even though the program does not throw any exception. The intended result UI I think looks like the first graph. So that's why I'm setting the error.type and error.msg here (according the 2nd picture from the datadog doc).
Do we also tag this span with the http.status_code ?
Yes, we do here right before the logic I added:
self.trigger_tags["http.status_code"] = status_code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically when we mark a span as an error because of the status code we don't set a traceback (e.g. don't we don't call span.set_traceback()
because there isn't an exception/traceback associated with the "failure").
it is ok to only set span.error = 1
, as long as we have the http.status_code
tag also set on the span (thanks for clarifying, I couldn't tell if trigger_tags
get added to the span or not) then the UI knows what to do with it.
But this is probably more of a product decision, do we want to add a traceback for these specific 500 responses? Do they provide value for customers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also 👋🏻 welcome!!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding trigger_tags
, can't be 100% sure but I think we did set it here:
datadog-lambda-python/datadog_lambda/wrapper.py
Lines 173 to 181 in a7cb6ef
if self.make_inferred_span: | |
self.inferred_span = create_inferred_span(event, context) | |
self.span = create_function_execution_span( | |
context, | |
self.function_name, | |
is_cold_start(), | |
trace_context_source, | |
self.merge_xray_traces, | |
self.trigger_tags, |
I will check with the team about the traceback. Thanks for the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: we're only setting self.error=1
and not setting the error.type
and error.msg
now. Frontend will handle like what APM does. The PR description is updated to reflect the change and screenshots are also updated.
datadog_lambda/constants.py
Outdated
@@ -42,3 +42,13 @@ class XrayDaemon(object): | |||
XRAY_TRACE_ID_HEADER_NAME = "_X_AMZN_TRACE_ID" | |||
XRAY_DAEMON_ADDRESS = "AWS_XRAY_DAEMON_ADDRESS" | |||
FUNCTION_NAME_HEADER_NAME = "AWS_LAMBDA_FUNCTION_NAME" | |||
|
|||
|
|||
SERVER_ERRORS_STATUS_CODES = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you probably don't need them any longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
What does this PR do?
aws.lambda.enhanced.errors
when the Lambda invocation is from AWS API Gateway, AWS Lambda Function URL, or ALB.span.error=1
.self.response
toNone
in_before
so that the response from last invocation wouldn't carry over into the current invocation.Motivation
When using a web framework, and something went wrong with a request, instead of raising an exception, the web framework typically returns a
5xx
response back to the client. From user's perspective, invocations returning a5xx
response code (aka, the value of the statusCode field from the response is5xx
) should be reported as errors in Datadog.Testing Guidelines
Additional Notes
To avoid sending duplicated
aws.lambda.enhanced.errors
metric,self.already_submitted_errors_metric_before
is added.Types of Changes
Check all that apply