## About the changes
We have many aggregation queries that run on a schedule:
f63496d47f/src/lib/metrics.ts (L714-L719)
These staticCounters are usually doing db query aggregations that
traverse tables and we run all of them in parallel:
f63496d47f/src/lib/metrics.ts (L410-L412)
This can add strain to the db. This PR suggests a way of handling these
queries in a more structured way, allowing us to run them sequentially
(therefore spreading the load):
f02fe87835/src/lib/metrics-gauge.ts (L38-L40)
As an additional benefit, we get both the gauge definition and the
queries in a single place:
f02fe87835/src/lib/metrics.ts (L131-L141)
This PR only tackles 1 metric, and it only focuses on gauges to gather
initial feedback. The plan is to migrate these metrics and eventually
incorporate more types (e.g. counters)
---------
Co-authored-by: Nuno Góis <github@nunogois.com>
Ideally `feature_lifecycle_stage_entered{stage="archived"}` would allow
me to see how many flags are archived per week.
It seems like the numbers for this is a bit off, and wanted to extend
our current `feature_toggle_update` counter with action details.
We are observing incorrect data in Prometheus, which is consistently
non-reproducible. After a restart, the issue does not occur, but if the
pods run for an extended period, they seem to enter a strange state
where the counters become entangled and start sharing arbitrary values
that are added to the counters.
For example, the `feature_lifecycle_stage_entered` counter gets an
arbitrary value, such as 12, added when `inc()` is called. The
`exceedsLimitErrorCounter` shows the same behavior, and the code
implementation is identical.
We also tested some existing `increase()` counters, and they do not
suffer from this issue.
All calls to `counter.labels(labels).inc(`) will be replaced by
`counter.increment()` to try to mitigate the issue.
This PR adds Grafana gauges for all the existing resource limits. The
primary purpose is to be able to use this in alerting. Secondarily, we
can also use it to get better insights into how many customers have
increased their limits, as well as how many people are approaching their
limit, regdardless of whether it's been increased or not.
## Discussion points
### Implementation
The first approach I took (in
87528b4c67),
was to add a new gauge for each resource limit. However, there's a lot
of boilerplate for it.
I thought doing it like this (the current implementation) would make it
easier. We should still be able to use the labelName to collate this in
Grafana, as far as I understand? As a bonus, we'd automatically get new
resource limits when we add them to the schema.
``` tsx
const resourceLimit = createGauge({
name: 'resource_limit',
help: 'The maximum number of resources allowed.',
labelNames: ['resource'],
});
// ...
for (const [resource, limit] of Object.entries(config.resourceLimits)) {
resourceLimit.labels({ resource }).set(limit);
}
```
That way, when checking the stats, we should be able to do something
like this:
``` promql
resource_limit{resource="constraintValues"}
```
### Do we need to reset gauges?
I noticed that we reset gauges before setting values in them all over
the place. I don't know if that's necessary. I'd like to get that double
clarified before merging this.
https://linear.app/unleash/issue/2-2501/adapt-origin-middleware-to-stop-logging-ui-requests-and-start
This adapts the new origin middleware to stop logging UI requests (too
noisy) and adds new Prometheus metrics.
<img width="745" alt="image"
src="https://github.com/user-attachments/assets/d0c7f51d-feb6-4ff5-b856-77661be3b5a9">
This should allow us to better analyze this data. If we see a lot of API
requests, we can dive into the logs for that instance and check the
logged data, like the user agent.
This PR adds some helper methods to make listening and emitting metric
events more strict in terms of types. I think it's a positive change
aligned with our scouting principle, but if you think it's complex and
does not belong here I'm happy with dropping it.
This PR adds prometheus metrics for when users attempt to exceed the
limits for a given resource.
The implementation sets up a second function exported from the
ExceedsLimitError file that records metrics and then throws the error.
This could also be a static method on the class, but I'm not sure that'd
be better.
This adds an extended metrics format to the metrics ingested by Unleash
and sent by running SDKs in the wild. Notably, we don't store this
information anywhere new in this PR, this is just streamed out to
Victoria metrics - the point of this project is insight, not analysis.
Two things to look out for in this PR:
- I've chosen to take extend the registration event and also send that
when we receive metrics. This means that the new data is received on
startup and on heartbeat. This takes us in the direction of collapsing
these two calls into one at a later point
- I've wrapped the existing metrics events in some "type safety", it
ain't much because we have 0 type safety on the event emitter so this
also has some if checks that look funny in TS that actually check if the
data shape is correct. Existing tests that check this are more or less
preserved
This PR adds metrics tracking for:
- "maxConstraintValues": the highest number of constraint values that
are in use
- "maxConstraintsPerStrategy": the highest number of constraints used on
a strategy
It updates the existing feature strategy read model that returns max
metrics for other strategy-related things.
It also moves one test into a more fitting describe block.
Adds a postgres_version gauge to allow us to see postgres_version in
prometheus and to post it upstream when version checking. Depends on
https://github.com/bricks-software/version-function/pull/20 to be merged
first to ensure our version-function doesn't crash when given the
postgres-version data.
Now we are also sending project id to prometheus, also querying from
database. This sets us up for grafana dashboard.
Also put the metrics behind flag, just incase it causes cpu/memory
issues.
## About the changes
This PR provides a service that allows a scheduled function to run in a
single instance. It's currently not in use but tests show how to wrap a
function to make it single-instance:
65b7080e05/src/lib/features/scheduler/job-service.test.ts (L26-L32)
The key `'test'` is used to identify the group and most likely should
have the same name as the scheduled job.
---------
Co-authored-by: Christopher Kolstad <chriswk@getunleash.io>
This PR adds a counter in Prometheus for counting the number of
"environment disabled" events we get per project. The purpose of this is
to establish a baseline for one of the "project management UI" project's
key results.
## On gauges vs counters
This PR uses a counter. Using a gauge would give you the total number of
envs disabled, not the number of disable events. The difference is
subtle, but important.
For projects that were created before the new feature, the gauge might
be appropriate. Because each disabled env would require at least one
disabled event, we can get a floor of how many events were triggered for
each project.
However, for projects created after we introduce the planned change,
we're not interested in the total envs anymore, because you can disable
a hundred envs on creation with a single action. In this case, a gauge
showing 100 disabled envs would be misleading, because it didn't take
100 events to disable them.
So the interesting metric here is how many times did you specifically
disable an environment in project settings, hence the counter.
## Assumptions and future plans
To make this easier on ourselves, we make the follow assumption: people
primarily disable envs **when creating a project**.
This means that there might be a few lagging indicators granting some
projects a smaller number of events than expected, but we may be able to
filter those out.
Further, if we had a metric for each project and its creation date, we
could correlate that with the metrics to answer the question "how many
envs do people disable in the first week? Two weeks? A month?". Or
worded differently: after creating a project, how long does it take for
people to configure environments?
Similarly, if we gather that data, it will also make filtering out the
number of events for projects created **after** the new changes have
been released much easier.
The good news: Because the project creation metric with dates is a
static aggregate, it can be applied at any time, even retroactively, to
see the effects.