-
Notifications
You must be signed in to change notification settings - Fork 14
Add metrics for expirationd stats #111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I am not familiar with metrics. I'm asking for your help @DifferentialOrange . I see the comment in #100, I decide to clarify how critical: 1.1. counter vs gauge usage, I prefer to use gauge becase a task can be killed and started again with the same name. So in fact the values may decrease. In additional we can support metrics since 0.10.0 . Also, I have a couple of questions about the implementation: 2.1. Should we call collector:remove({name = task_name}) if the task is killed? I implemented as if the answers are And the last one: 3.1 Should we install I will be glad if you review the code of the draft pull request too. Also, I think it would be nice if the implementation looked like |
e4c34e6
to
f489c5b
Compare
I haven't reviewed the code yet, starting with answering your comment.
In fact, it is not a decrease: it's a combination of reset and follow-up increase. By concept, counters support reset and increases, thus this is valid behavior. It is also would be processed as needed by Prometheus/InfluxDB. So I prefer counters here since it is legal.
Well, it's quite the opposite. For example, see https://www.robustperception.io/on-the-naming-of-things/ You shouldn't use suffixes
I think it is better to remove it, yeah. It may be still legal to leave it as is and not clean up, but I think removing it will be better. 2.2. Should we call collectore:remove({name = task_name}) if an user call expirationd.cfg({metrics = false}) In crud metrics I destroy collectors. I think you should do it like this too (and unregister callbacks if there will be any).
No, I think it shouldn't be a requirement. Let's make it like in CRUD.
I'll do that a bit later.
Yes, I've seen this implementation. There are also Grafana dashboard panels for them (https://github.com/tarantool/tdg2/blob/8d6909edf0e07743c9666485f0d514a26a2dceaf/common/metrics/expirationd.lua). I haven't reviewed this one, but if I would, I would've asked to change several things. You may use anything from this implementation if it seems useful, but I don't think we should make this implementation as close to that one as possible. More likely it will be vice versa: when we make in-built expirationd metrics, TDG developers may replace their implememntation with ours, so we may fix possible flaws here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked through a code. I have left no comment because all of them are covered with our discussion.
b57d6c0
to
6c088b0
Compare
Thanks for the detailed answers. I hope I followed all advices. I have saved the old names for the collectors, but it should be ok now:
The pull request is ready for a review. |
cfg = setmetatable({}, { | ||
__index = cfg, | ||
__newindex = function() error("Use expirationd.cfg{} instead", 2) end, | ||
__call = expirationd_cfg, | ||
__serialize = function() return cfg end, | ||
}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it work as expected across cartridge's role reloads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also stash.lua from crud.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it will not. But we do not currently provide a cartridge role. I think it would be redundant to provide such functionality. It will be better to do (and to test) in #107 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit formal way to look on the problem: sure, we can reject such request till #107, but it does not mean that we have no usages in cartridge applications. I would even say the opposite: we have several internal users at least.
The only thing I'm afraid is something mysterious and weird: for example, if the statistics will disappear at all or if it will be counted twice. If the actual behaviour is like "the expirationd.cfg.metrics
configuration parameter may lie after reload, but reconfiguration always give a proper result", it should be okay till #107.
I'll leave it up to you to think on risks and decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be ok if an user write stop()/start() cartridge code in his role:
local function init()
expirationd.cfg({metrics = true})
end
local function stop()
expirationd.cfg({metrics = false})
end
It may be the reason why it should be disabled by default.
Anyway, there should be a possibility to test this before implementation. It requires an implementation of a cartridge role (at least a basic implementation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I'm afraid is something mysterious and weird: for example, if the statistics will disappear at all or if it will be counted twice.
It seems to me it works fine (we don't count the stats twice, we don't save an old callbacks and collectors between reloads) after a .reload(). It does not save previous stats, but I think it's expected now (we don't save previous tasks too).
myapp.router> metrics.collectors()
---
- &0
expirationd_restartscounter:
registry: &1
collectors: *0
label_pairs:
alias: router
callbacks:
'function: 0x422a9e48': true
'function: 0x420fc878': true
'function: 0x40914a78': true
'function: 0x415eed88': true
'function: 0x415f8d10': true
'function: 0x41778950': true
'function: 0x420ff4c0': true
'function: 0x4010eb48': true
'function: 0x41e8f988': true
'function: 0x42108a98': true
'function: 0x406705f0': true
'function: 0x41e91c68': true
'function: 0x407a3ff8': true
'function: 0x41128c98': true
'function: 0x415eacd0': true
'function: 0x41f0e0f8': true
help: expirationd task's a number of restarts
observations: []
name: expirationd_restarts
label_pairs: []
expirationd_expired_countcounter:
registry: *1
help: expirationd task's a number of expired tuples
observations: []
name: expirationd_expired_count
label_pairs: []
expirationd_checked_countcounter:
registry: *1
help: expirationd task's a number of checked tuples
observations: []
name: expirationd_checked_count
label_pairs: []
expirationd_working_timecounter:
registry: *1
help: expirationd task's operation time
observations: []
name: expirationd_working_time
label_pairs: []
...
myapp.router> require("cartridge.roles").reload()
---
- true
...
myapp.router> metrics = require('metrics')
---
...
myapp.router> metrics.collectors()
---
- []
...
myapp.router> expirationd = require('expirationd')
---
...
myapp.router> metrics.collectors()
---
- &0
expirationd_restartscounter:
registry: &1
collectors: *0
label_pairs:
alias: router
callbacks:
'function: 0x40edaaa8': true
'function: 0x41e742b0': true
'function: 0x41f34368': true
'function: 0x417fbdf0': true
'function: 0x4011ac50': true
'function: 0x4122d978': true
'function: 0x4163b250': true
'function: 0x41e7f960': true
'function: 0x41ec7b28': true
'function: 0x42129c10': true
'function: 0x403de478': true
'function: 0x41cf5330': true
'function: 0x40da7d20': true
'function: 0x417f1390': true
'function: 0x40120a28': true
'function: 0x4199ba00': true
help: expirationd task's a number of restarts
observations: []
name: expirationd_restarts
label_pairs: []
expirationd_expired_countcounter:
registry: *1
help: expirationd task's a number of expired tuples
observations: []
name: expirationd_expired_count
label_pairs: []
expirationd_checked_countcounter:
registry: *1
help: expirationd task's a number of checked tuples
observations: []
name: expirationd_checked_count
label_pairs: []
expirationd_working_timecounter:
registry: *1
help: expirationd task's operation time
observations: []
name: expirationd_working_time
label_pairs: []
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, cartrige creates myapp with metrics 0.12.0 by default. 0.13.0 changes the behavior. Would you like such workaroud in metrics_enable() call?
--- a/expirationd.lua
+++ b/expirationd.lua
@@ -94,6 +94,16 @@ local function metrics_enable()
),
}
+ -- Workaround for a cartridge role reload:
+ --
+ -- Metrics package does not lose observation after a cartridge reload since
+ -- 0.13.0. expirationd does not yet support the cartridge reload (it
+ -- requires saving and restarting tasks at least). So, we are acting here
+ -- as if all expirationd tasks have been killed: reset all collectors.
+ for _, c in pairs(metrics_collectors) do
+ metrics.registry:unregister(c.collector)
+ metrics.registry:register(c.collector)
+ end
+
We plan to support a cartridge role soon, so I think it will be ok as a temporary solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I had to go the way suggested by @Totktonada . It is required because metrics 0.13.0 saves callbacks (I think it is an error, tarantool/metrics#378 ) together with collectors.
For the full cleanup we need to unregister a callback too, but this is only possible if we save the callback in a stash.
expirationd.lua
Outdated
@@ -21,6 +22,7 @@ local function get_fiber_id(fiber) | |||
end | |||
|
|||
local task_list = {} | |||
local cfg = {metrics = false} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not enabled by default, when metrics
module is present, as in crud
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to enable additional features by a client code. But I don't know this case very well. If we usually enable the metrics collection by default, then I will do the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misguide you, sorry. It is a bit different for crud
: a user should enable statistics explicitly (crud.cfg({stats = true})
), but 'local'
or 'metrics'
driver is choosen automatically depending on whether the statistics is enabled.
If we'll look on expirationd in those terms, our 'local'
driver is always enabled (it likely has low overhead). So we either enable 'metrics'
driver or not. Hm.
(Should we make <...>.cfg()
options similar across modules? In theory, the answer should be 'yes', but there are differences like this 'always enabled' local driver. Interesting.)
Back to the question. There are cons and pros. If a statistics is just available on the default metrics dashboard in any project without explicit enabling it module-by-module, it is convenient. OTOH, we can add only low overhead statistics this way.
So I would ask it as: whether the overhead is low enough to enable it by default? I guess the answer is 'yes'.
However I has no enough level of confidence to insist on it. We need Georgy's vote for quorum :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DifferentialOrange could you help us please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the overhead is definitely low since the only operation that is performed by metrics is a callback call (it's several mathematical and assignment actions once in a minute or something, depends on monitoring agents). So no one should be affected by this in terms of performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I enable it by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I have no objections.
I left several points to Oleg to consider and decide. The patchset has my ok disregarding of the decisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patches!
6738ee2
to
9c9e2f2
Compare
9c9e2f2
to
2110e46
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for patches, Oleg.
LGTM
bd7bcfb
to
ffc0997
Compare
09bff4a
to
06e4b42
Compare
06e4b42
to
92179bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be ok.
AFAIU, you haven't added code reload and cartridge reload tests because expirationd doesn't support it yet? Or simple package reload should be ok? If it is, I think it's better to check that everything is alright with tests.
92179bc
to
fb221b1
Compare
I'll try to add a simple test with the cartridge reload. |
fb221b1
to
f4ea19d
Compare
Added. |
The patch adds the ability to export statistics to metrics >= 0.11.0. expirationd does not require the metrics package itself and tries to use an installed one.
It also adds a new API method expirationd.cfg({options}).
Closes #100