Skip to content

Add metrics for expirationd stats #111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 24, 2022
Merged

Conversation

oleg-jukovec
Copy link
Contributor

@oleg-jukovec oleg-jukovec commented Jun 17, 2022

The patch adds the ability to export statistics to metrics >= 0.11.0. expirationd does not require the metrics package itself and tries to use an installed one.

It also adds a new API method expirationd.cfg({options}).

Closes #100

@oleg-jukovec
Copy link
Contributor Author

oleg-jukovec commented Jun 17, 2022

I am not familiar with metrics. I'm asking for your help @DifferentialOrange .

I see the comment in #100, I decide to clarify how critical:

1.1. counter vs gauge usage, I prefer to use gauge becase a task can be killed and started again with the same name. So in fact the values may decrease. In additional we can support metrics since 0.10.0 .
1.2. Is _total naming required? I prefer to use existing naming, I think it will be more expected.

Also, I have a couple of questions about the implementation:

2.1. Should we call collector:remove({name = task_name}) if the task is killed?
2.2. Should we call collectore:remove({name = task_name}) if an user call expirationd.cfg({metrics = false})

I implemented as if the answers are No, but I'm not sure is it a correct behavior.

And the last one:

3.1 Should we install metrics as a requirement for expirationd? I implemented it in a way similar to crud and do not metrics into requirements.

I will be glad if you review the code of the draft pull request too.

Also, I think it would be nice if the implementation looked like
TDG2 implemenation.

@oleg-jukovec oleg-jukovec force-pushed the oleg-jukovec/gh-100-metrics branch from e4c34e6 to f489c5b Compare June 19, 2022 15:44
@DifferentialOrange
Copy link
Member

I haven't reviewed the code yet, starting with answering your comment.

1.1. counter vs gauge usage, I prefer to use gauge becase a task can be killed and started again with the same name. So in fact the values may decrease. In additional we can support metrics since 0.10.0 .

In fact, it is not a decrease: it's a combination of reset and follow-up increase. By concept, counters support reset and increases, thus this is valid behavior. It is also would be processed as needed by Prometheus/InfluxDB. So I prefer counters here since it is legal.

1.2. Is _total naming required? I prefer to use existing naming, I think it will be more expected.

Well, it's quite the opposite. For example, see https://www.robustperception.io/on-the-naming-of-things/

You shouldn't use suffixes _total, _sum, _bucket explicitly. If you use summary or histogram collector, they will be added internally, because it's the Prometheus standard of handling such things up. You shouldn't use suffix _count with any other collector rather than counter. On the other hand, using _count with counter is not requires, it is optional. But in most cases it is better to name counter as *_count so it will be easier to understand by users.

2.1. Should we call collector:remove({name = task_name}) if the task is killed?

I think it is better to remove it, yeah. It may be still legal to leave it as is and not clean up, but I think removing it will be better.

2.2. Should we call collectore:remove({name = task_name}) if an user call expirationd.cfg({metrics = false})

In crud metrics I destroy collectors. I think you should do it like this too (and unregister callbacks if there will be any).
https://github.com/tarantool/crud/blob/8e0065233cfedf56f21c09a8c7c1e62a0d6e8dbb/crud/stats/metrics_registry.lua#L156

3.1 Should we install metrics as a requirement for expirationd? I implemented it in a way similar to crud and do not metrics into requirements.

No, I think it shouldn't be a requirement. Let's make it like in CRUD.

I will be glad if you review the code of the draft pull request too.

I'll do that a bit later.

Also, I think it would be nice if the implementation looked like TDG2 implemenation.

Yes, I've seen this implementation. There are also Grafana dashboard panels for them (https://github.com/tarantool/tdg2/blob/8d6909edf0e07743c9666485f0d514a26a2dceaf/common/metrics/expirationd.lua). I haven't reviewed this one, but if I would, I would've asked to change several things. You may use anything from this implementation if it seems useful, but I don't think we should make this implementation as close to that one as possible. More likely it will be vice versa: when we make in-built expirationd metrics, TDG developers may replace their implememntation with ours, so we may fix possible flaws here.

Copy link
Member

@DifferentialOrange DifferentialOrange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked through a code. I have left no comment because all of them are covered with our discussion.

@oleg-jukovec oleg-jukovec force-pushed the oleg-jukovec/gh-100-metrics branch 3 times, most recently from b57d6c0 to 6c088b0 Compare June 20, 2022 12:21
@oleg-jukovec
Copy link
Contributor Author

Thanks for the detailed answers. I hope I followed all advices.

I have saved the old names for the collectors, but it should be ok now:

On the other hand, using _count with counter is not requires, it is optional. But in most cases it is better to name counter as *_count so it will be easier to understand by users.

The pull request is ready for a review.

@oleg-jukovec oleg-jukovec marked this pull request as ready for review June 20, 2022 12:30
Comment on lines +1087 to +1133
cfg = setmetatable({}, {
__index = cfg,
__newindex = function() error("Use expirationd.cfg{} instead", 2) end,
__call = expirationd_cfg,
__serialize = function() return cfg end,
}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it work as expected across cartridge's role reloads?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also stash.lua from crud.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it will not. But we do not currently provide a cartridge role. I think it would be redundant to provide such functionality. It will be better to do (and to test) in #107 .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit formal way to look on the problem: sure, we can reject such request till #107, but it does not mean that we have no usages in cartridge applications. I would even say the opposite: we have several internal users at least.

The only thing I'm afraid is something mysterious and weird: for example, if the statistics will disappear at all or if it will be counted twice. If the actual behaviour is like "the expirationd.cfg.metrics configuration parameter may lie after reload, but reconfiguration always give a proper result", it should be okay till #107.

I'll leave it up to you to think on risks and decide.

Copy link
Contributor Author

@oleg-jukovec oleg-jukovec Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be ok if an user write stop()/start() cartridge code in his role:

local function init()
    expirationd.cfg({metrics = true})
end

local function stop()
    expirationd.cfg({metrics = false})
end

It may be the reason why it should be disabled by default.

Anyway, there should be a possibility to test this before implementation. It requires an implementation of a cartridge role (at least a basic implementation).

Copy link
Contributor Author

@oleg-jukovec oleg-jukovec Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I'm afraid is something mysterious and weird: for example, if the statistics will disappear at all or if it will be counted twice.

It seems to me it works fine (we don't count the stats twice, we don't save an old callbacks and collectors between reloads) after a .reload(). It does not save previous stats, but I think it's expected now (we don't save previous tasks too).

myapp.router> metrics.collectors()
---
- &0
  expirationd_restartscounter:
    registry: &1
      collectors: *0
      label_pairs:
        alias: router
      callbacks:
        'function: 0x422a9e48': true
        'function: 0x420fc878': true
        'function: 0x40914a78': true
        'function: 0x415eed88': true
        'function: 0x415f8d10': true
        'function: 0x41778950': true
        'function: 0x420ff4c0': true
        'function: 0x4010eb48': true
        'function: 0x41e8f988': true
        'function: 0x42108a98': true
        'function: 0x406705f0': true
        'function: 0x41e91c68': true
        'function: 0x407a3ff8': true
        'function: 0x41128c98': true
        'function: 0x415eacd0': true
        'function: 0x41f0e0f8': true
    help: expirationd task's a number of restarts
    observations: []
    name: expirationd_restarts
    label_pairs: []
  expirationd_expired_countcounter:
    registry: *1
    help: expirationd task's a number of expired tuples
    observations: []
    name: expirationd_expired_count
    label_pairs: []
  expirationd_checked_countcounter:
    registry: *1
    help: expirationd task's a number of checked tuples
    observations: []
    name: expirationd_checked_count
    label_pairs: []
  expirationd_working_timecounter:
    registry: *1
    help: expirationd task's operation time
    observations: []
    name: expirationd_working_time
    label_pairs: []
...
myapp.router> require("cartridge.roles").reload()
---
- true
...

myapp.router> metrics = require('metrics')
---
...

myapp.router> metrics.collectors()
---
- []
...

myapp.router> expirationd = require('expirationd')
---
...

myapp.router> metrics.collectors()
---
- &0
  expirationd_restartscounter:
    registry: &1
      collectors: *0
      label_pairs:
        alias: router
      callbacks:
        'function: 0x40edaaa8': true
        'function: 0x41e742b0': true
        'function: 0x41f34368': true
        'function: 0x417fbdf0': true
        'function: 0x4011ac50': true
        'function: 0x4122d978': true
        'function: 0x4163b250': true
        'function: 0x41e7f960': true
        'function: 0x41ec7b28': true
        'function: 0x42129c10': true
        'function: 0x403de478': true
        'function: 0x41cf5330': true
        'function: 0x40da7d20': true
        'function: 0x417f1390': true
        'function: 0x40120a28': true
        'function: 0x4199ba00': true
    help: expirationd task's a number of restarts
    observations: []
    name: expirationd_restarts
    label_pairs: []
  expirationd_expired_countcounter:
    registry: *1
    help: expirationd task's a number of expired tuples
    observations: []
    name: expirationd_expired_count
    label_pairs: []
  expirationd_checked_countcounter:
    registry: *1
    help: expirationd task's a number of checked tuples
    observations: []
    name: expirationd_checked_count
    label_pairs: []
  expirationd_working_timecounter:
    registry: *1
    help: expirationd task's operation time
    observations: []
    name: expirationd_working_time
    label_pairs: []
...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not save previous stats

metrics should support package reload without losing observations since 0.9.0 and cartridge reload without losing observations since 0.13.0

Copy link
Contributor Author

@oleg-jukovec oleg-jukovec Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, cartrige creates myapp with metrics 0.12.0 by default. 0.13.0 changes the behavior. Would you like such workaroud in metrics_enable() call?

--- a/expirationd.lua
+++ b/expirationd.lua
@@ -94,6 +94,16 @@ local function metrics_enable()
         ),
     }
 
+    -- Workaround for a cartridge role reload:
+    --
+    -- Metrics package does not lose observation after a cartridge reload since
+    -- 0.13.0. expirationd does not yet support the cartridge reload (it
+    -- requires saving and restarting tasks at least). So, we are acting here
+    -- as if all expirationd tasks have been killed: reset all collectors.
+    for _, c in pairs(metrics_collectors) do
+         metrics.registry:unregister(c.collector)
+         metrics.registry:register(c.collector)
+    end
+

We plan to support a cartridge role soon, so I think it will be ok as a temporary solution.

Copy link
Contributor Author

@oleg-jukovec oleg-jukovec Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I had to go the way suggested by @Totktonada . It is required because metrics 0.13.0 saves callbacks (I think it is an error, tarantool/metrics#378 ) together with collectors.

For the full cleanup we need to unregister a callback too, but this is only possible if we save the callback in a stash.

expirationd.lua Outdated
@@ -21,6 +22,7 @@ local function get_fiber_id(fiber)
end

local task_list = {}
local cfg = {metrics = false}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not enabled by default, when metrics module is present, as in crud?

Copy link
Contributor Author

@oleg-jukovec oleg-jukovec Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to enable additional features by a client code. But I don't know this case very well. If we usually enable the metrics collection by default, then I will do the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misguide you, sorry. It is a bit different for crud: a user should enable statistics explicitly (crud.cfg({stats = true})), but 'local' or 'metrics' driver is choosen automatically depending on whether the statistics is enabled.

If we'll look on expirationd in those terms, our 'local' driver is always enabled (it likely has low overhead). So we either enable 'metrics' driver or not. Hm.

(Should we make <...>.cfg() options similar across modules? In theory, the answer should be 'yes', but there are differences like this 'always enabled' local driver. Interesting.)

Back to the question. There are cons and pros. If a statistics is just available on the default metrics dashboard in any project without explicit enabling it module-by-module, it is convenient. OTOH, we can add only low overhead statistics this way.

So I would ask it as: whether the overhead is low enough to enable it by default? I guess the answer is 'yes'.

However I has no enough level of confidence to insist on it. We need Georgy's vote for quorum :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DifferentialOrange could you help us please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the overhead is definitely low since the only operation that is performed by metrics is a callback call (it's several mathematical and assignment actions once in a minute or something, depends on monitoring agents). So no one should be affected by this in terms of performance

Copy link
Contributor Author

@oleg-jukovec oleg-jukovec Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I enable it by default.

Copy link
Member

@Totktonada Totktonada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I have no objections.

I left several points to Oleg to consider and decide. The patchset has my ok disregarding of the decisions.

Copy link
Member

@ligurio ligurio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patches!

@oleg-jukovec oleg-jukovec force-pushed the oleg-jukovec/gh-100-metrics branch 4 times, most recently from 6738ee2 to 9c9e2f2 Compare June 21, 2022 06:50
@oleg-jukovec oleg-jukovec requested a review from ligurio June 21, 2022 06:57
@oleg-jukovec oleg-jukovec force-pushed the oleg-jukovec/gh-100-metrics branch from 9c9e2f2 to 2110e46 Compare June 21, 2022 07:04
Copy link
Member

@ligurio ligurio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for patches, Oleg.

LGTM

@oleg-jukovec oleg-jukovec force-pushed the oleg-jukovec/gh-100-metrics branch 3 times, most recently from bd7bcfb to ffc0997 Compare June 21, 2022 10:23
@oleg-jukovec oleg-jukovec force-pushed the oleg-jukovec/gh-100-metrics branch 3 times, most recently from 09bff4a to 06e4b42 Compare June 21, 2022 13:15
@oleg-jukovec oleg-jukovec force-pushed the oleg-jukovec/gh-100-metrics branch from 06e4b42 to 92179bc Compare June 21, 2022 14:01
@oleg-jukovec oleg-jukovec modified the milestone: nnnnnnnnn nnnnnnnm Jun 21, 2022
@oleg-jukovec oleg-jukovec requested review from ligurio and DifferentialOrange and removed request for DifferentialOrange June 23, 2022 12:59
Copy link
Member

@DifferentialOrange DifferentialOrange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be ok.

AFAIU, you haven't added code reload and cartridge reload tests because expirationd doesn't support it yet? Or simple package reload should be ok? If it is, I think it's better to check that everything is alright with tests.

@oleg-jukovec oleg-jukovec force-pushed the oleg-jukovec/gh-100-metrics branch from 92179bc to fb221b1 Compare June 24, 2022 06:47
@oleg-jukovec
Copy link
Contributor Author

If it is, I think it's better to check that everything is alright with tests.

I'll try to add a simple test with the cartridge reload.

The patch adds the ability to export statistics to metrics >= 0.11.0.
expirationd does not require the metrics package itself and tries to
use an installed one.

It also adds a new API method expirationd.cfg({options}).

Part of #100
Into the GitHub workflow and the Makefile.

Closes #100
@oleg-jukovec oleg-jukovec force-pushed the oleg-jukovec/gh-100-metrics branch from fb221b1 to f4ea19d Compare June 24, 2022 09:58
@oleg-jukovec
Copy link
Contributor Author

I'll try to add a simple test with the cartridge reload.

Added.

@oleg-jukovec oleg-jukovec merged commit e68baf4 into master Jun 24, 2022
@oleg-jukovec oleg-jukovec deleted the oleg-jukovec/gh-100-metrics branch June 24, 2022 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add metrics for expirationd stats
4 participants