# Summary
Add optional lazy collection with TTL to our createGauge wrapper,
allowing a gauge to fetch its value on scrape and cache it for a
configurable duration. This lets us register a collect function directly
at gauge declaration without changing existing call sites or behavior.
We're experimenting with this, reason why we're only applying the
solution to `users_total` and will evaluate afterwards.
# Problem
- Some gauges should be computed on scrape (e.g., expensive or external
lookups) instead of being pushed continuously.
- Our current `createGauge` helper doesn’t make it easy to attach a
`collect` with caching. Each caller has to reimplement timing, caching,
and error handling.
- This leads to repeated costly work, inconsistent handling of unknown
values, and boilerplate.
# What changed
- `createGauge` now accepts two optional options in addition to the
usual prom-client options:
- `fetchValue?: () => Promise<number | null>`
- `ttlMs?: number`
- When `fetchValue` is provided:
- We install a `collect` that fetches on scrape.
- Successful values are cached for `ttlMs` milliseconds (if `ttlMs` >
0).
- If `ttlMs` is 0 or omitted, we fetch on every scrape.
- If `fetchValue` returns null or throws, we set `NaN` (indicates
`"unknown"`).
# Behavior details
## Caching:
- A value is “fresh” when successfully fetched within `ttlMs`.
- Only numeric successes are cached. null and errors are not cached;
we’ll refetch on the next scrape.
## Unknown values:
- null or thrown errors set the gauge to `NaN` so Prometheus won’t treat
it as zero.
## Compatibility:
- Backward compatible. Existing uses of `createGauge` are unchanged.
If a user-supplied `collect` exists, it still runs after the TTL logic
(can overwrite the value by design).
- API remains the same for the returned wrapper: `{ gauge, labels,
reset, set }`.
https://linear.app/unleash/issue/2-3738/clear-unknown-flags-every-24h-instead-of-every-7d
Clears unknown flags every 24h instead of every 7d.
This ensures the list stays more relevant by removing stale entries
sooner, allowing users to focus on actively reported unknown flags.
Also includes small improvements, including a new paragraph on the
unknown flags page that better explains the concept of unknown flag
reports.
**BREAKING CHANGE**: DEFAULT_ENV changed from `default` (should not be
used anymore) to `development`
## About the changes
- Only delete default env if the install is fresh new.
- Consider development the new default. The main consequence of this
change is that the default is no longer considered `type=production`
environment but also for frontend tokens due to this assumption:
724c4b78a2/src/lib/schema/api-token-schema.test.ts (L54-L59)
(I believe this is mostly due to the [support for admin
tokens](https://github.com/Unleash/unleash/pull/10080#discussion_r2126871567))
- `feature_toggle_update_total` metric reports `n/a` in environment and
environment type as it's not environment specific
We're migrating to ESM, which will allow us to import the latest
versions of our dependencies.
Co-Authored-By: Christopher Kolstad <chriswk@getunleash.io>
https://linear.app/unleash/issue/2-3406/hold-unknown-flags-in-memory-and-show-them-in-the-ui-somehow
This PR introduces a suggestion for a “unknown flags” feature.
When clients report metrics for flags that don’t exist in Unleash (e.g.
due to typos), we now track a limited set of these unknown flag names
along with the appnames that reported them. The goal is to help users
identify and clean up incorrect flag usage across their apps.
We store up to 10 unknown flag + appName combinations, keeping only the
most recent reports. Data is collected in-memory and flushed
periodically to the DB, with deduplication and merging to ensure we
don’t exceed the cap even across pods.
We were especially careful to make this implementation defensive, as
unknown flags could be reported in very high volumes. Writes are
batched, deduplicated, and hard-capped to avoid DB pressure.
No UI has been added yet — this is backend-only for now and intended as
a step toward better visibility into client misconfigurations.
I would suggest starting with a simple banner that opens a dialog
showing the list of unknown flags and which apps reported them.
<img width="497" alt="image"
src="https://github.com/user-attachments/assets/b7348e0d-0163-4be4-a7f8-c072e8464331"
/>
As part of preparation for ESM and node/TSC updates, this PR will make
Unleash build with strictNullChecks set to true, since that's what's in
our tsconfig file. Hence, this PR also removes the `--strictNullChecks
false` flag in our compile tasks in package.json.
TL;DR - Clean up your code rather than turning off compiler security
features :)
When there is new revision, we will start storing memory footprint for
old client-api and the new delta-api.
We will be sending it as prometheus metrics.
The memory size will only be recalculated if revision changes, which
does not happen very often.
## About the changes
We have many aggregation queries that run on a schedule:
f63496d47f/src/lib/metrics.ts (L714-L719)
These staticCounters are usually doing db query aggregations that
traverse tables and we run all of them in parallel:
f63496d47f/src/lib/metrics.ts (L410-L412)
This can add strain to the db. This PR suggests a way of handling these
queries in a more structured way, allowing us to run them sequentially
(therefore spreading the load):
f02fe87835/src/lib/metrics-gauge.ts (L38-L40)
As an additional benefit, we get both the gauge definition and the
queries in a single place:
f02fe87835/src/lib/metrics.ts (L131-L141)
This PR only tackles 1 metric, and it only focuses on gauges to gather
initial feedback. The plan is to migrate these metrics and eventually
incorporate more types (e.g. counters)
---------
Co-authored-by: Nuno Góis <github@nunogois.com>
Ideally `feature_lifecycle_stage_entered{stage="archived"}` would allow
me to see how many flags are archived per week.
It seems like the numbers for this is a bit off, and wanted to extend
our current `feature_toggle_update` counter with action details.
We are observing incorrect data in Prometheus, which is consistently
non-reproducible. After a restart, the issue does not occur, but if the
pods run for an extended period, they seem to enter a strange state
where the counters become entangled and start sharing arbitrary values
that are added to the counters.
For example, the `feature_lifecycle_stage_entered` counter gets an
arbitrary value, such as 12, added when `inc()` is called. The
`exceedsLimitErrorCounter` shows the same behavior, and the code
implementation is identical.
We also tested some existing `increase()` counters, and they do not
suffer from this issue.
All calls to `counter.labels(labels).inc(`) will be replaced by
`counter.increment()` to try to mitigate the issue.
This PR adds Grafana gauges for all the existing resource limits. The
primary purpose is to be able to use this in alerting. Secondarily, we
can also use it to get better insights into how many customers have
increased their limits, as well as how many people are approaching their
limit, regdardless of whether it's been increased or not.
## Discussion points
### Implementation
The first approach I took (in
87528b4c67),
was to add a new gauge for each resource limit. However, there's a lot
of boilerplate for it.
I thought doing it like this (the current implementation) would make it
easier. We should still be able to use the labelName to collate this in
Grafana, as far as I understand? As a bonus, we'd automatically get new
resource limits when we add them to the schema.
``` tsx
const resourceLimit = createGauge({
name: 'resource_limit',
help: 'The maximum number of resources allowed.',
labelNames: ['resource'],
});
// ...
for (const [resource, limit] of Object.entries(config.resourceLimits)) {
resourceLimit.labels({ resource }).set(limit);
}
```
That way, when checking the stats, we should be able to do something
like this:
``` promql
resource_limit{resource="constraintValues"}
```
### Do we need to reset gauges?
I noticed that we reset gauges before setting values in them all over
the place. I don't know if that's necessary. I'd like to get that double
clarified before merging this.
https://linear.app/unleash/issue/2-2501/adapt-origin-middleware-to-stop-logging-ui-requests-and-start
This adapts the new origin middleware to stop logging UI requests (too
noisy) and adds new Prometheus metrics.
<img width="745" alt="image"
src="https://github.com/user-attachments/assets/d0c7f51d-feb6-4ff5-b856-77661be3b5a9">
This should allow us to better analyze this data. If we see a lot of API
requests, we can dive into the logs for that instance and check the
logged data, like the user agent.
This PR adds some helper methods to make listening and emitting metric
events more strict in terms of types. I think it's a positive change
aligned with our scouting principle, but if you think it's complex and
does not belong here I'm happy with dropping it.
This PR adds prometheus metrics for when users attempt to exceed the
limits for a given resource.
The implementation sets up a second function exported from the
ExceedsLimitError file that records metrics and then throws the error.
This could also be a static method on the class, but I'm not sure that'd
be better.